Posts

The Case for Predictive Models 2024-04-03T18:22:20.243Z
Searching for Searching for Search 2024-02-14T23:51:20.162Z
Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies 2023-05-26T17:44:35.575Z
Conditioning Predictive Models: Open problems, Conclusion, and Appendix 2023-02-10T19:21:20.251Z
Mechanism Design for AI Safety - Agenda Creation Retreat 2023-02-10T03:05:56.467Z
Conditioning Predictive Models: Deployment strategy 2023-02-09T20:59:01.473Z
Conditioning Predictive Models: Interactions with other approaches 2023-02-08T18:19:22.670Z
Conditioning Predictive Models: Making inner alignment as easy as possible 2023-02-07T20:04:20.272Z
Conditioning Predictive Models: The case for competitiveness 2023-02-06T20:08:55.404Z
Conditioning Predictive Models: Outer alignment via careful conditioning 2023-02-02T20:28:58.955Z
Conditioning Predictive Models: Large language models as predictors 2023-02-02T20:28:46.612Z
Stop-gradients lead to fixed point predictions 2023-01-28T22:47:35.008Z
Underspecification of Oracle AI 2023-01-15T20:10:42.190Z
Proper scoring rules don’t guarantee predicting fixed points 2022-12-16T18:22:23.547Z
Mechanism Design for AI Safety - Reading Group Curriculum 2022-10-25T03:54:20.777Z
Mesa-optimization for goals defined only within a training environment is dangerous 2022-08-17T03:56:43.452Z
Announcing: Mechanism Design for AI Safety - Reading Group 2022-08-09T04:21:50.551Z
Abram Demski's ELK thoughts and proposal - distillation 2022-07-19T06:57:35.265Z
Bounded complexity of solving ELK and its implications 2022-07-19T06:56:18.152Z

Comments

Comment by Rubi J. Hudson (Rubi) on The Case for Predictive Models · 2024-04-04T16:05:48.224Z · LW · GW

Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives. 

  • You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.

If we're looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don't see in current systems.  Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.

My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they're used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.

Comment by Rubi J. Hudson (Rubi) on The Case for Predictive Models · 2024-04-03T20:35:49.824Z · LW · GW

I'd be very interested in hearing the reasons why you're skeptical of the approach, even a bare-bones outline if that's all you have time for.

Comment by Rubi J. Hudson (Rubi) on Abram Demski's ELK thoughts and proposal - distillation · 2024-02-08T00:08:16.255Z · LW · GW

Ah, ok, I see what you're saying now. I don't see any reason why restricting to input space counterfactuals wouldn't work, beyond the issues described with predictor-state counterfactuals. Possibly a performance hit from needing to make larger changes. In the worst case, a larger minimum change size might hurt with specifying the direct reporter.

Comment by Rubi J. Hudson (Rubi) on Abram Demski's ELK thoughts and proposal - distillation · 2024-02-07T19:48:29.719Z · LW · GW

Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question.

I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.

Comment by Rubi J. Hudson (Rubi) on AI Views Snapshots · 2023-12-14T10:56:01.530Z · LW · GW

I find one consistent crux I have with people not concerned about AI risk is that they believe massively more resources will be invested into technical safety before AGI is developed.

In the context of these statements, I would put it as something like "The number of people working full-time on technical AI Safety will increase by an order of magnitude by 2030".

Comment by Rubi J. Hudson (Rubi) on List of strategies for mitigating deceptive alignment · 2023-12-06T06:40:01.554Z · LW · GW

Long-term planning is another capability that is likely necessary for deceptive alignment that could. Obviously a large alignment tax, but there are potentially ways to mitigate that. It seems at least as promising as some other approaches you listed.

Comment by Rubi J. Hudson (Rubi) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2023-11-18T02:02:18.572Z · LW · GW

I don't find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be "give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3". If you consider this all part of the training process (and I think that's a fair characterization),  model that starts with goal misgeneralization quickly becomes a schemer too.

Comment by Rubi J. Hudson (Rubi) on Game Theory without Argmax [Part 1] · 2023-11-16T02:52:25.766Z · LW · GW

I think this part uses an unfair comparison:

Supposes that  and  are small finite sets. A task  can be implemented as dictionary whose keys lie in  and whose values lie in , which uses  bits. The functional  can be implemented as a program which receives input of type  and returns output of type . Easy!

In the subjective account, by contrast, the task  requires infinite bits to specify, and the functional  must somehow accept a representation of an arbitrary function . Oh no! This is especially troubling for embedded agency, where the agent's decision theory must run on a physical substrate.

If X and W+ are small finite sets, then any behavior can be described with a utility function requiring only a finite number of bits to specify. You only need to use R as the domain when W+ is infinite, such as when outcomes are continuous, in which case the dictionaries require infinite bits to specify too.

I think this is representative of an unease I have with the framing of this sequence. It seems to be saying that the more general formulation allows for agents that behave in ways that utility maximizers cannot, but most of these behaviors exist for maximizers of certain utility functions. I'm still waiting for the punchline of what AI safety relevant aspect requires higher order game theory rather than just maximizing agents, particularly if you allow for informational constraints.

Comment by Rubi J. Hudson (Rubi) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-06-01T20:34:41.161Z · LW · GW

I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Comment by Rubi J. Hudson (Rubi) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-05-29T11:23:39.683Z · LW · GW

Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.

Comment by Rubi J. Hudson (Rubi) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-05-29T11:10:27.202Z · LW · GW

Thanks Caspar, your comments here and on earlier drafts are appreciated. We'll expand more on the positioning within the related literature as we develop this into a paper.

As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment). 

Comment by Rubi J. Hudson (Rubi) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-05-29T10:46:28.662Z · LW · GW

For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.

For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this action will destroy the world, the humans won't choose it", which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.

Comment by Rubi J. Hudson (Rubi) on Conditioning Predictive Models: Outer alignment via careful conditioning · 2023-02-20T03:22:06.444Z · LW · GW

In the first part of this sequence, we clarify that we are focusing on the case where the model is a predictive model of the world. The fourth part, on making inner alignment as easy as possible, outlines some reasons why we think this kind of predictive model is possible (even likely) outcome of the training process. Of course, it is also possible that the model is not precisely a predictive model, but is still close enough to one that the content of "Conditioning Predictive Models" is still relevant.

Comment by Rubi J. Hudson (Rubi) on Conditioning Predictive Models: Interactions with other approaches · 2023-02-20T02:07:30.495Z · LW · GW

Yes, you are correct that RL with KL penalties only approximates a Bayesian update in the limit, after enough steps to converge. Determining the speed of this convergence, especially for LLMs, remains an area for future work.

Comment by Rubi J. Hudson (Rubi) on Mechanism Design for AI Safety - Agenda Creation Retreat · 2023-02-10T19:03:09.118Z · LW · GW

I'm not sure, but if anyone knows how to contact them, they could be a great fit.

Comment by Rubi J. Hudson (Rubi) on Underspecification of Oracle AI · 2023-01-16T16:57:10.628Z · LW · GW

While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it. Given that uncertainty, a discussion of non-myopic oracles seems worthwhile.

Additionally, a major point of this post is that myopia alone is not sufficient for safety, a myopic agent with an acausal decision theory can behave in dangerous ways to influence the world over time. Even if we were guaranteed myopia by default, it would still be necessary to discuss decision rules. 

Comment by Rubi J. Hudson (Rubi) on Underspecification of Oracle AI · 2023-01-16T16:40:15.136Z · LW · GW

I don't believe we considered logical counterfactuals as such, but it seems to me that those would be quite comparable to the counterfactual of replacing an oracle with a simpler system.

Comment by Rubi J. Hudson (Rubi) on Mechanism Design for AI Safety - Reading Group Curriculum · 2022-12-28T19:15:09.158Z · LW · GW

Not yet! We're now meeting on a monthly schedule, and there has only been one meeting since completing the list here. I'll look into finding a relevant paper on the subject, but if you have any recommendations please let me know.

Comment by Rubi J. Hudson (Rubi) on Where to be an AI Safety Professor · 2022-12-08T15:08:45.479Z · LW · GW

My impression is that the majority of the benefit from having professors working on AI safety is in mentorship to students who are already interested in AI safety, rather than recruitment. For example, I have heard that David Krueger's lab is mostly people who went to Cambridge specifically to work on AI safety under him. If that's the case, there's less value in working at a school with generally talented students but more value in schools with a supportive environment. 

In general it's good to recognize that what matters to AI safety professors is different than what matters to many other CS professors and that optimizing for the same thing other PhD students are is suboptimal. However, as Lawrence pointed out, it's already a rare case to have offers from multiple top schools, and even rarer not have one offer dominate the others under both sets of values. It's a more relevant consideration for incoming PhD students, where multiple good offers is more common. 

I also like that your analysis can flow in reverse. Not all AI safety professors are in their schools CS faculties, with Jacob Steinhardt and Victor Veitch coming to mind as examples in their schools' statistics faculties. For PhD students outside CS, the schools you identified as overachievers make excellent targets. On a personal note, that was an important factor in deciding to do my PhD.

Comment by Rubi J. Hudson (Rubi) on Announcing: Mechanism Design for AI Safety - Reading Group · 2022-10-25T04:05:23.826Z · LW · GW

Update: the reading list has now been posted.

Comment by Rubi J. Hudson (Rubi) on Takeaways from our robust injury classifier project [Redwood Research] · 2022-09-22T04:47:15.742Z · LW · GW

It sounds like you have a number of ideas as to why robustness was not achieved and how to correct those issues. Why is the project over now, rather than continuing having made those updates?

Comment by Rubi J. Hudson (Rubi) on Announcing: Mechanism Design for AI Safety - Reading Group · 2022-08-22T23:05:32.569Z · LW · GW

Yeah, the full reading list will be posted publicly once it's finalized.

Thanks for the recommendation! I was planning on including something from yourself/Vince/out of FOCAL, but wasn't sure which option to go with.

Comment by Rubi J. Hudson (Rubi) on Mesa-optimization for goals defined only within a training environment is dangerous · 2022-08-18T03:30:35.794Z · LW · GW

I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.

As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that's unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.

Comment by Rubi J. Hudson (Rubi) on Abram Demski's ELK thoughts and proposal - distillation · 2022-08-15T19:57:07.583Z · LW · GW

Re-reading your prior comment, I think I misunderstood it initially.

Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.

Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario. 

This proposal also works as an "audit" at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.

Comment by Rubi J. Hudson (Rubi) on Bounded complexity of solving ELK and its implications · 2022-08-15T18:32:24.677Z · LW · GW

I think some generality is necessary, otherwise we'd have to retrain the reporter every time the predictor is updated. That would rule out a lot of desirable uses for a reporter, like using its output in the training process.

Comment by Rubi J. Hudson (Rubi) on Abram Demski's ELK thoughts and proposal - distillation · 2022-08-01T23:33:58.295Z · LW · GW

I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.

The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also just use the question answering functionality to do search over predictor state space without understanding it until we find a state that gives the desired answers to a set of questions.

Comment by Rubi J. Hudson (Rubi) on Bounded complexity of solving ELK and its implications · 2022-08-01T23:22:57.011Z · LW · GW

I don't necessarily think we'd get an incoherent output, since it needs to be able to generalize to new questions, I expect a direct translator to answer questions by using computations to understanding a predictor (plus a model of natural language), rather than a function that maps the state of a particular predictor to answers for each question.

One reporter might only be able to understand the predictor up to a human level. If it gets a predictor with a human level understanding of the world, it can act as a direct translator, but if it gets a more complex predictor it would act as a human translator.

Comment by Rubi J. Hudson (Rubi) on On how various plans miss the hard bits of the alignment challenge · 2022-07-12T05:39:52.070Z · LW · GW

Or more generally increasing intelligence, for example through smart drugs or brain-computer interfaces.