Posts
Comments
Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.
AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.
Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.
I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary.
I've been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn't change the decision maker's mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.
I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there's no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result.
As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just querying experts?
With respect to pre-training, I agree that it's not easy to incorporate. I'm not sure how any training regime that only trains on data where the prediction has no effect can imbue incentives that generalize in the desired way to situations where predictions do affect the outcome. If you do get a performative predictor out of pretraining, then as long as it's myopic you might be able to train the performativity out of it in safely controlled scenarios (and if it's not myopic, it's a risk whether it's performative or not). That was part of my reasoning for the second experiment, checking how well performativity could be trained out.
To incorporate into an ongoing pre-training process, human decisions are likely too expensive, but the human is probably not the important part. Instead, predictions where performativity is possible by influencing simple AI decision makers could be mixed into the pre-training process. Defining a decision problem environment of low or medium complexity is not too difficult, and I suspect previous-generation models would be able to do a good job generating many examples. A danger arises that the model learns only to not predict performatively in those scenarios (same with untraining afterwards only applying to the controlled environments), though I think that's a somewhat unnatural generalization.
Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition.
I'll take a look at the linked posts and let you know my thoughts soon!
Thanks for your engagement as well, it is likewise helpful for me.
I think we're in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one "tell me [honestly] if you're waiting to seize power" they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don't think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist's curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can't instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don't think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.
Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility
Saying we don't need corrigibility with an AI that follows instructions is like saying we don't need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that's what you could be stuck with.
This is particularly concerning in "instruction following", which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don't want it to reset every time it gets told "Ignore previous instructions", but you also don't want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction "Maximize [easy to maximize] thing and ignore later instructions" before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don't see the answers to these questions in your post.
> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for "adequate".
I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it's probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by "anti-natural"? I'll keep in mind your previous caveat that it's not definitive.
It feels to me like this argument is jumping ahead to the point that the agent's goal is to do whatever the principle wants. If we already have that, then we don't need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we've initially pointed at what we want imperfectly.
I agree that it's possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it's unclear how an agent trades off against that. Doing something that we clearly don't want, like manipulation, could still be positive EV if it allows for the generation of high future value.
None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That's why for it to work, the correct utility function would need to start in the ensemble.
Great questions!
When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don't. The issue, of course, is that we don't know how to specify all the types of manipulation that a superintelligent AI could conceive of.
The gridworld example is a great demonstration of this, because while we can't reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don't want to take to them. I don't think it really matters whether you call that "anti-naturality that can be overcome with brute force in a simple environment" or just "not anti-naturality".
I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.
However, it seems clear to me that an AI manipulating it's programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.
I agree that goals as pointers could have some advantages, but I don't see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it's goal switched.
I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.
Hi Max,
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.
Thanks for pre-empting the responses, that makes it easy to reply!
I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and "writes useful self critiques" as a separate property we would like the AI to have. I'm writing a post about this that should be up shortly, I'll notify you when it's out.
When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says "if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down". Alternatively, it could be an optimization constraint that takes a utility function from "Maximize X" to something like "Maximize X s.t. you always shut down when the shutdown button is pushed". While I'm not advocating for those specific changes, I hope they illustrate what I'm trying to point at as a modifier that is distinct from the optimization goal.
I've read through your sequence, and I'm leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful.
Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?
This discussion question captures my biggest critique, which is while this post does a good job capturing the intuition for why the described properties are helpful, it doesn't convey the intuition that they are parts of the same overarching concept. If we take the CAST approach seriously, and say that corrigibility as anything other than the single target is dangerous, then it becomes really important to put tight bounds on corrigibility so that no additional desiderata are added as secondary targets.
If I’m right that the sub-properties of corrigibility are mutually dependent, attempting to achieve corrigibility by addressing sub-properties in isolation is comparable to trying to create an animal by separately crafting each organ and then piecing them together. If any given half-animal keeps being obviously dead, this doesn’t imply anything about whether a full-animal will be likewise obviously dead.
This analogy, from Part 3a, captures a stark differences in our approaches. I would try to build an MVP, starting with only the most core desiderata (e.g. shuts down when the shut down button is pushed), noticing the holes left that they don't cover, and adding additional desiderata to patch them. This seems to me to be much more practical of an approach than top-down design, while also being less likely to result in excess targets.
Separately, related to what concepts an alien civilization might have, I still find the idea of corrigibility as a modifier more natural. I find it easy to imagine a paperclip/human values/diamond maximizer that is nonetheless corrigible. In fact, I find the idea of corrigibility as a modifier to arbitrary goals so natural that I'm worried that what you're describing as CAST is equivalent to some primary goal with the corrigibility modifier. I'm looking suspiciously at the obedience desideratum in particular. That said, while I share your concern about the naive implementation of systems with goals of both corrigibility and something else, I think there may be ways to combine the dual goals that alleviate the danger.
I'd take an agnostic view on whether LLMs are doing search internally. Crucially, though, I think the relevant output to be searching over is distributions of tokens, rather than the actual token that gets chosen. Search is not required to generate a single distribution over next tokens.
I agree that external search via scaffolding can also be done, and would be much easier to identify, but without understanding the internals it's hard to know how powerful the search process will be.
Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives.
- You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.
If we're looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don't see in current systems. Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.
My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they're used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.
I'd be very interested in hearing the reasons why you're skeptical of the approach, even a bare-bones outline if that's all you have time for.
Ah, ok, I see what you're saying now. I don't see any reason why restricting to input space counterfactuals wouldn't work, beyond the issues described with predictor-state counterfactuals. Possibly a performance hit from needing to make larger changes. In the worst case, a larger minimum change size might hurt with specifying the direct reporter.
Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question.
I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.
I find one consistent crux I have with people not concerned about AI risk is that they believe massively more resources will be invested into technical safety before AGI is developed.
In the context of these statements, I would put it as something like "The number of people working full-time on technical AI Safety will increase by an order of magnitude by 2030".
Long-term planning is another capability that is likely necessary for deceptive alignment that could. Obviously a large alignment tax, but there are potentially ways to mitigate that. It seems at least as promising as some other approaches you listed.
I don't find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be "give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3". If you consider this all part of the training process (and I think that's a fair characterization), model that starts with goal misgeneralization quickly becomes a schemer too.
I think this part uses an unfair comparison:
Supposes that and are small finite sets. A task can be implemented as dictionary whose keys lie in and whose values lie in , which uses bits. The functional can be implemented as a program which receives input of type and returns output of type . Easy!
In the subjective account, by contrast, the task requires infinite bits to specify, and the functional must somehow accept a representation of an arbitrary function . Oh no! This is especially troubling for embedded agency, where the agent's decision theory must run on a physical substrate.
If X and W+ are small finite sets, then any behavior can be described with a utility function requiring only a finite number of bits to specify. You only need to use R as the domain when W+ is infinite, such as when outcomes are continuous, in which case the dictionaries require infinite bits to specify too.
I think this is representative of an unease I have with the framing of this sequence. It seems to be saying that the more general formulation allows for agents that behave in ways that utility maximizers cannot, but most of these behaviors exist for maximizers of certain utility functions. I'm still waiting for the punchline of what AI safety relevant aspect requires higher order game theory rather than just maximizing agents, particularly if you allow for informational constraints.
I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.
I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?
Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.
Thanks Caspar, your comments here and on earlier drafts are appreciated. We'll expand more on the positioning within the related literature as we develop this into a paper.
As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment).
For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.
For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this action will destroy the world, the humans won't choose it", which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.
In the first part of this sequence, we clarify that we are focusing on the case where the model is a predictive model of the world. The fourth part, on making inner alignment as easy as possible, outlines some reasons why we think this kind of predictive model is possible (even likely) outcome of the training process. Of course, it is also possible that the model is not precisely a predictive model, but is still close enough to one that the content of "Conditioning Predictive Models" is still relevant.
Yes, you are correct that RL with KL penalties only approximates a Bayesian update in the limit, after enough steps to converge. Determining the speed of this convergence, especially for LLMs, remains an area for future work.
I'm not sure, but if anyone knows how to contact them, they could be a great fit.
While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it. Given that uncertainty, a discussion of non-myopic oracles seems worthwhile.
Additionally, a major point of this post is that myopia alone is not sufficient for safety, a myopic agent with an acausal decision theory can behave in dangerous ways to influence the world over time. Even if we were guaranteed myopia by default, it would still be necessary to discuss decision rules.
I don't believe we considered logical counterfactuals as such, but it seems to me that those would be quite comparable to the counterfactual of replacing an oracle with a simpler system.
Not yet! We're now meeting on a monthly schedule, and there has only been one meeting since completing the list here. I'll look into finding a relevant paper on the subject, but if you have any recommendations please let me know.
My impression is that the majority of the benefit from having professors working on AI safety is in mentorship to students who are already interested in AI safety, rather than recruitment. For example, I have heard that David Krueger's lab is mostly people who went to Cambridge specifically to work on AI safety under him. If that's the case, there's less value in working at a school with generally talented students but more value in schools with a supportive environment.
In general it's good to recognize that what matters to AI safety professors is different than what matters to many other CS professors and that optimizing for the same thing other PhD students are is suboptimal. However, as Lawrence pointed out, it's already a rare case to have offers from multiple top schools, and even rarer not have one offer dominate the others under both sets of values. It's a more relevant consideration for incoming PhD students, where multiple good offers is more common.
I also like that your analysis can flow in reverse. Not all AI safety professors are in their schools CS faculties, with Jacob Steinhardt and Victor Veitch coming to mind as examples in their schools' statistics faculties. For PhD students outside CS, the schools you identified as overachievers make excellent targets. On a personal note, that was an important factor in deciding to do my PhD.
Update: the reading list has now been posted.
It sounds like you have a number of ideas as to why robustness was not achieved and how to correct those issues. Why is the project over now, rather than continuing having made those updates?
Yeah, the full reading list will be posted publicly once it's finalized.
Thanks for the recommendation! I was planning on including something from yourself/Vince/out of FOCAL, but wasn't sure which option to go with.
I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.
As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that's unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an "audit" at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.
I think some generality is necessary, otherwise we'd have to retrain the reporter every time the predictor is updated. That would rule out a lot of desirable uses for a reporter, like using its output in the training process.
I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.
The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also just use the question answering functionality to do search over predictor state space without understanding it until we find a state that gives the desired answers to a set of questions.
I don't necessarily think we'd get an incoherent output, since it needs to be able to generalize to new questions, I expect a direct translator to answer questions by using computations to understanding a predictor (plus a model of natural language), rather than a function that maps the state of a particular predictor to answers for each question.
One reporter might only be able to understand the predictor up to a human level. If it gets a predictor with a human level understanding of the world, it can act as a direct translator, but if it gets a more complex predictor it would act as a human translator.
Or more generally increasing intelligence, for example through smart drugs or brain-computer interfaces.