Posts
Comments
Nice, you've expressed the generalization argument for expecting goal-directedness really well. Most of the post seems to match my beliefs.
I’m moderately optimistic about blackbox control (maybe 50-70% risk reduction on high-stakes failures?).
I want you to clarify what this means, and try to get some of the latent variables behind it.
One interpretation is that you mean any specific high-stakes attempt to subvert control measures is 50-70% likely to fail. But if we kept doing approximately the same set-up after this, then an attempt would soon succeed with high probability.
The way I think about blackbox control, the success condition is "getting an alignment solution that we trust". So another interpretation is you're saying 30-50% chance of this (maybe conditioned on no low-stakes sabotage)? If so, this implies a guess about the scale of effort required, could you give description of that scale? Because if you're thinking Manhattan-project-scale with lots of exploration and freedom of experimentation, then this is a very different level of optimism than if you're thinking about a single PhD project scale.
I’m high on scheming probability (65% or higher on inside view but lower once I defer to people).
Why not higher? I don't see where the inside view uncertainty is coming from. Is it uncertainty over how future training will be done? Or uncertainty over the shape of mind-space-in-general, where accidental implicit biases related to e.g. {"conservativeness", "trying-hard", "follow heuristics instead of supergoal reasoning"} might make instrumental reward seeking unlikely by default?
It's not about building less useful technology, that's not what Abram or Ryan are talking about (I assume). The field of alignment has always been about strongly superhuman agents. You can have tech that is useful and also safe to use, there's no direct contradiction here.
Maybe one weak-ish historical analogy is explosives? Some explosives are unstable, and will easily explode by accident. Some are extremely stable, and can only be set off by a detonator. Early in the industrial chemistry tech tree, you only have access to one or two ways to make explosives. If you're desperate, you use these whether or not they are stable, because the risk-usefulness tradeoff is worth it. A bunch of your soldiers will die, and your weapons caches will be easier to destroy, but that's a cost you might be willing to pay. As your industrial chemistry tech advances, you invent many different types of explosive, and among these choices you find ones that are both stable explosives and effective, because obviously this is better in every way.
Maybe another is medications? As medications advanced, as we gained choice and specificity in medications, we could choose medications that had both low side-effects and were effective. Before that, there was often a choice, and the correct choice was often to not use the medicine unless you were literally dying.
In both these examples, sometimes the safety-usefulness tradeoff was worth it, sometimes not. Presumably people in both cases people often made the choice not to use unsafe explosives or unsafe medicine, because the risk wasn't worth it.
As it is with these technologies, so it is with AGI. There are a bunch future paradigms of AGI building. The first one we stumble into isn't looking like one where we can precisely specify what it wants. But if we were able to keep experimenting and understanding and iterating after the first AGI, and we gradually developed dozens of ways of building AGI, then I'm confident we could find one that is just as intelligent and also could have its goals precisely specified.
My two examples above don't quite answer your question, because "humanity" didn't steer away from using them, just individual people at particular times. For examples where all or large sections of humanity steered away from using an extremely useful tech whose risks purportedly outweighed benefits: Project Plowshare, nuclear power in some countries, GMO food in some countries, viral bioweapons (as far as I know), eugenics, stem cell research, cloning. Also {CFCs, asbestos, leaded petrol, CO2 to some extent, radium, cocaine, heroin} after the negative externalities were well known.
I guess my point is that safety-usefulness tradeoffs are everywhere, and tech development choices that take into account risks are made all the time. To me, this makes your question utterly confused. Building technology that actually does what you want (which is be safe and useful) is just standard practice. This is what everyone does, all the time, because obviously safety is one of the design requirements of whatever you're building.
The main difference with between above technologies and AGI is that it's a trapdoor. The cost of messing up AGI is that you lose any chance to try again. AGI shares with some of the above technologies an epistemic problem. For many of them it isn't clear in advance, to most people, how much risk there actually is, and therefore whether the tradeoff is worth it.
After writing this, it occurred to me that maybe by "competitive" you meant "earlier in the tech tree"? I interpreted it in my comment as a synonym of "useful" in a sense that excluded safe-to-use.
Can you link to where RP says that?
Do you not see how they could be used here?
This one. I'm confused about what the intuitive intended meaning of the symbol is. Sorry, I see why "type signature" was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe is a boolean fact that is edited? But if so I don't know which fact it is, and I'm confused by the way you described it.
Because we're talking about priors and their influence, all of this is happening inside the agent's brain. The agent is going about daily life, and thinks "hm, maybe there is an evil demon simulating me who will give me -101010^10 utility if I don't do what they want for my next action". I don't see why this is obviously ill-defined without further specification of the training setup.
Can we replace this with: "The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to -1010 utility."? This is what it's like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.
I'm not sure what the type signature of is, or what it means to "not take into account 's simulation". When makes decisions about which actions to take, it doesn't have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to "not take it into account"?
So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operation
I think you've misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it's in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That's all I meant by "get its beliefs".
Well my response to this was:
In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
But I'll expand: An agent doing that kind of game-theory reasoning needs to model the situation it's in. And to do that modelling it needs a prior. Which might be malign.
Malign agents in the prior don't feel like malign agents in the prior, from the perspective of the agent with the prior. They're just beliefs about the way the world is. You need beliefs in order to choose actions. You can't just decide to act in a way that is independent of your beliefs, because you've decided your beliefs are out to get you.
On top of this, how would you even decide that your beliefs are out to get you? Isn't this also a belief?
Yeah I know that bound, I've seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.
How does this connect to malign prior problems?
But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn't matter what the decision theory is.
To respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.
Because we have the prediction error bounds.
Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it'll stick with the mesa-inductor. And if it has goals, it can wait as long as it wants to make a false prediction that helps achieve its goals. (Or just make false predictions about counterfactuals that are unlikely to be chosen).
If I'm wrong then I'd be extremely interested in seeing your reasoning. I'd maybe pay $400 for a post explaining the reasoning behind why prediction error bounds rule out mesa-optimisers in the prior.
You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?
Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.
When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn't involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.
Edit to respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.
In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient.
So the "hypotheses" inside your inductor won't actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it's done a brute force search over a huge space of programs until it finds one that works. Plausibly it'll just find a better efficient induction algorithm, with a sane prior.
I'm not sure whether it implies that you should be able to make a task-based AGI.
Yeah I don't understand what you mean by virtues in this context, but I don't see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it's different then we might communicate better.
(Later you mention unboundedness too, which I think should be added to difficulty here)
By unbounded I just meant the kind of task where it's always possible to do better by using a better plan. It basically just means that an agent will select the highest difficulty version of the task that is achievable. I didn't intend it as a different thing from difficulty, it's basically the same.
I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.
True, but I don't think the virtue part is relevant. This applies to all instrumental goals, see here (maybe also the John-Max discussion in the comments).
It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.
I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us. Which I think is a good approach! But the open problems for making a task-based AGI still apply, in particular the inner alignment problems.
agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".
Obvious nitpick: It's just "gain as much power as is helpful for achieving whatever my goals are". I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.[1]
But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?
[...]
Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?
(Assuming that the inner loop <-> outer loop interface problem is solved, so the inner loop isn't going to take control). Depends on the tasks that the outer loop is giving to the part-capable-of-consequentialism. If it's giving nice easy bounded tasks, then no, there's no reason to expect it to take over the world as a sub-task.
But since we ultimately want the AGI to be useful for avoiding takeover from other AGIs, it's likely that some of the tasks will be difficult and/or unbounded. For those difficult unbounded tasks, becoming powerful enough to take over the world is often the easiest/best path.
- ^
I'm assuming soft optimisation here. Without soft optimisation, there's an incentive to gain power as long as that marginally increases the chance of success, which it usually does. Soft optimisation solves that problem.
But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.
I like this reason to be unsatisfied with the EUM theory of agency.
One of the difficulties in theorising about agency is that all the theories are flexible enough to explain anything. Each theory is incomplete and vague in some way, so this makes the problem worse, but even when you make a detailed model of e.g. active inference, it ends up being pretty much formally equivalent to EUM.
I think the solution to this is to compare theories using engineering desiderata. Our goal is ultimately to build a safe AGI, so we want a theory that helps us reason about safety desiderata.
One of the really important safety desiderata is some kind of goal stability. When we build a powerful agent, we don't want it to change its mind about what's important. It should act to achieve known, predictable outcomes, even when it discovers facts and concepts we don't know about.
So my criticism of this research direction is that I don't think it'll be a good framework for making goal-stable agents. You want a framework that naturally models internal conflict of goals, and in particular you want to model this as conflict between agents. Conflict and cooperation between bounded, not-quite-rational agents is messy and hard to predict. Multi-agent systems are complex and detail dependent. Therefore it seems difficult to show that the overall agent will be stable.
(A reasonable response would be "but no proposed vague theories of bounded agency have this goal stability property, maybe this coalitional approach will turn out to help us come up with a solution", and that's true and fair enough, but I think research directions like this seem more promising).
I think the scheme you're describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.
It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.
But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.
Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.
I guess I shouldn't respond too much in public until you've published the doc, but:
- If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
- A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn't allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
- There are chunks of these ideas that definitely aren't "prosaic and relatively unenlightened ML research", and involve very-high-trust security stuff or non-trivial epistemic work.
- I'd be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I'm baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.
The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time
Agree it's unclear. I think the chance of most of the ideas being helpful depends on some variables that we don't clearly know yet. I think 90% risk improvement can't be right, because there's a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top.
One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure chaos. Politically and epistemically and with all the work we need to do. I think pushing toward this chaotic world is much worse than other worlds we could push for right now.
But if I thought control was likely to work very well and saw a much more plausible path to alignment among the "stuff to try", I'd think it was a reasonable strategy.
I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively
On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.
This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works),
This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.
I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I'm-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.
FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations.
I think there exists an extremely strong/unrealistic version of believing in "data-efficient long-horizon RL" that does allow this. I'm aware you don't believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn't make sense?
Yep this is the third crux I think. Perhaps the most important.
To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of?
For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?
I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility.
Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop.
I don't see why you would have more trust in agents created this way.
(My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I'd read more. Idk why this one was upvoted more, that's silly).
these are also alignment failures we see in humans.
Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???
There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.
How is this evidence in favour of your plan ultimately resulting in a solution to alignment???
but these systems empirically often move in reasonable and socially-beneficial directions over time
Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved?
and i expect we can make AI agents a lot more aligned than humans typically are
Ahh I see. Yeah this is crazy, why would you expect this? I think maybe you're confusing yourself by using the word "aligned" here, can we taboo it? Human reflective instability looks like: they realize they don't care about being a lawyer and go become a monk. Or they realize they don't want to be a monk and go become a hippy (this one's my dad). Or they have a mid-life crisis and do a bunch of stereotypical mid-life crisis things. Or they go crazy in more extreme ways.
We have a lot of experience with the space of human reflective instabilities. We're pretty familiar with the ways that humans interact with tribes and are influenced by them, and sometimes break with them.
But the space of reflective-goal-weirdness is much larger and stranger than we have (human) experience with. There are a lot of degrees of freedom in goal specification that we can't nail down easily through training. Also, AIs will be much newer, much more in progress, than humans are (not quite sure how to express this, another way to say it is to point to the quantity of robustness&normality training that evolution has subjected humans to).
Therefore I think it's extremely, wildly wrong to expect "we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are".
but, Claude sure as hell seems to
Why do you even consider this relevant evidence?
[Edit 25/02/25:
To expand on this last point, you're saying:
If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.
It seems like you're doing the same dichotomy here, where you say it's either pretending or it's aligned. I know that they will act like they care about the law. We both see the same evidence, I'm not just ignoring it. I just think you're interpreting this evidence poorly, perhaps by being insufficiently careful about "alignment" as meaning "reflectively goal stable with predictable goals and predictable instabilities" vs "acts like a law-abiding citizen at the moment".
]
to the extent developers succeed in creating faithful simulators
There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.
If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.
There are weaker versions, which I think are what Ryan believes will be possible. In a slightly weaker case, you don't get something anywhere close to a human simulation, but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).
But I think the evidence is against this. Long horizon tasks are currently difficult to successfully train on, unless you have dense intermediate feedback. Capabilities progress in the last decade has come from leaning heavily on dense intermediate feedback.
I expect long-horizon RL to remain pretty low data efficiency (i.e. take a lot of data before it generalizes well OOD).
My guess is that your core mistake is here:
When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.
Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiously misaligned. You seem to be assuming that there is mostly a dichotomy between "not egregiously misaligned" and "conniving to satisfy some other set of preferences". But there are a lot of messy places in between these two positions, including "I'm not really sure what I want" or <goals-that-are-highly-dependent-on-the-environment-e.g.-status-seeking>.
All AIs you train will be somewhere in this in between messy place. What you are hoping for is that if you put a group of these together, they will "self-correct" and force/modify each other to keep pursuing to the same goals-you-trained-them-to-look-like-they-wanted?
Is this basically correct? If so, this won't work just because this is absolute chaos and the goals-you-trained-them-to-look-like-they-wanted aren't enough to steer this chaotic system where you want it to go.
are these agents going to do sloppy research?
I think there were a few times where you are somewhat misreading your critics when they say "slop". It doesn't mean "bad". It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.
E.g. I find it difficult to use LLMs to help me do math or code weird algorithms, because they are good enough at outputting something that looks right. It feels like it takes longer to detect and fix their mistakes than it does to do it from scratch myself.
(Some) acceleration doesn't require being fully competitive with humans while deference does.
Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful.
I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting new ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.
Maybe some kinds of "safety work", but real alignment involves a human obtaining a deep understanding of intelligence and agency. The path to this understanding probably isn't made of >90% moderate duration ML tasks. (You need >90% to get 5-10x because of communication costs, it's often necessary to understand details of experiment implementation to get insight from them. And costs from the AI making mistakes and not quite doing the experiments right).
(vague memory from the in person discussions we had last year, might be inaccurate):
jeremy!2023: If you're expecting AI to be capable enough to "accelerate alignment research" significantly, it'll need to be a full-blown agent that learns stuff. And that'll be enough to create alignment problems because data-efficient long-horizon generalization is not something we can do.
joshc!2023: No way, all you need is AI with stereotyped skills. Imagine how fast we could do interp experiments if we had AIs that were good at writing code but dumb in other ways!
...
joshc!now:
Training AI agents so they can improve their beliefs (e.g. do research) as well as the best humans can.
Seems like the reasoning behind your conclusions has changed a lot since we talked, but the conclusions haven't changed much?
If you were an AI: Negative reward, probably a bad belief updating process.
In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?
We can't reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn't get the desired behaviour. Maybe you have a particular translation method in mind?
I don't mess up the medical test because true information is instrumentally useful to me, given my goals.
Yep that's what I meant. The goal u
is constructed to make information about h
instrumentally useful for achieving u
, even if g
is poorly specified. The agent can prefer h
over ~h
or vice versa, just as we prefer a particular outcome of a medical test. But because of the instrumental (information) value of the test, we don't interfere with it.
I think the utility indifference genre of solutions (which try to avoid preferences between shutdown and not-shutdown) are unnatural and create other problems. My approach allows the agent to shutdown even if it would prefer to be in the non-shutdown world.
With regards to the agent believing that it's impossible to influence the probability that its plan passes validation
This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn't want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn't mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.
I propose: the best planners must break the beta.
Because if a planner is going to be the best, it needs to be capable of finding unusual (better!) plans. If it's capable of finding those, there's ~no benefit of knowing the conventional wisdom about how to do it (climbing slang: beta).
Edit: or maybe: good planners don't need beta?
I think you're wrong to be psychoanalysing why people aren't paying attention to your work. You're overcomplicating it. Most people just think you're wrong upon hearing a short summary, and don't trust you enough to spend time learning the details. Whether your scenario is important or not, from your perspective it'll usually look like people are bouncing off for bad reasons.
For example, I read the executive summary. For several shallow reasons,[1] the scenario seemed unlikely and unimportant. I didn't expect there to be better arguments further on. So I stopped. Other people have different world models and will bounce off for different reasons.
Which isn't to say it's wrong (that's just my current weakly held guess). My point is just that even if you're correct, the way it looks a priori to most worldviews is sufficient to explain why people are bouncing off it and not engaging properly.
Perhaps I'll encounter information in the future that indicates my bouncing off was a mistake, and I'll go back.
- ^
There are a couple of layers of maybes, so the scenario doesn't seem likely. I expect power to be more concentrated. I expect takeoff to be faster. I expect capabilities to have a high cap. I expect alignment to be hard for any goal. Something about maintaining a similar societal structure without various chaotic game-board-flips seems unlikely. The goals-instilled-in-our-replacements are pretty specific (institution-aligned), and pretty obviously misaligned from overall human flourishing. Sure humans are usually myopic, but we do sometimes consider the consequences and act against local incentives.
I don't know whether these reasons are correct, or how well you've argued against them. They're weakly held and weakly considered, so I wouldn't have usually written them down. They are just here to make my point more concrete.
The description of how sequential choice can be defined is helpful, I was previously confused by how this was supposed to work. This matches what I meant by preferences over tuples of outcomes. Thanks!
We'd incorrectly rule out the possibility that the agent goes for (B+,B).
There's two things we might want from the idea of incomplete preferences:
- To predict the actions of agents.
- Because complete agents behave dangerously sometimes, and we want to design better agents with different behaviour.
I think modelling an agent as having incomplete preferences is great for (1). Very useful. We make better predictions if we don't rule out the possibility that the agent goes for B after choosing B+. I think we agree here.
For (2), the relevant quote is:
As a general point, you can always look at a decision ex post and back out different ways to rationalise it. The nontrivial task is here prediction, using features of the agent.
If we can always rationalise a decision ex post as being generated by a complete agent, then let's just build that complete agent. Incompleteness isn't helping us, because the behaviour could have been generated by complete preferences.
Perhaps I'm misusing the word "representable"? But what I meant was that any single sequence of actions generate by the agent could also have been generated by an outcome-utility maximizer (that has the same world model). This seems like the relevant definition, right?
That's not right
Are you saying that my description (following) is incorrect?
[incomplete preferences w/ caprice] would be equivalent to 1. choosing the best policy by ranking them in the partial order of outcomes (randomizing over multiple maxima), then 2. implementing that policy without further consideration.
Or are you saying that it is correct, but you disagree that this implies that it is "behaviorally indistinguishable from an agent with complete preferences"? If this is the case, then I think we might disagree on the definition of "behaviorally indistinguishable"? I'm using it like: If you observe a single sequence of actions from this agent (and knowing the agent's world model), can you construct a utility function over outcomes that could have produced that sequence.
Or consider another example. The agent trades A for B, then B for A, then declines to trade A for B+. That's compatible with the Caprice rule, but not with complete preferences.
This is compatible with a resolute outcome-utility maximizer (for whom A is a maxima). There's no rule that says an agent must take the shortest route to the same outcome (right?).
As Gustafsson notes, if an agent uses resolute choice to avoid the money pump for cyclic preferences, that agent has to choose against their strict preferences at some point.
...
There's no such drawback for agents with incomplete preferences using resolute choice.
Sure, but why is that a drawback? It can't be money pumped, right? Agents following resolute choice often choose against their local strict preferences in other decision problems. (E.g. Newcomb's). And this is considered an argument in favour of resolute choice.
I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.
My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).
Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.
You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.
Dialogues are more difficult to create (if done well between people with different beliefs), and are less pleasant to read, but are often higher value for reaching true beliefs as a group.
Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?
This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it's not super clear which bits are most likely to be directly useful.
This seems like the ultimate goal of interp research (and it's a good goal). Or, I think the current story for heuristic arguments is using them to "explain" a trained neural network by breaking it down into something more like an X,Y,Z components explanation.
At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.
(The above is the "understood" end of the axis. The "not-understood" end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).
This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.
Also what is its relation to natural language?
Not sure what you're getting at here.
Fair enough, good points. I guess I classify these LLM agents as "something-like-an-LLM that is genuinely creative", at least to some extent.
Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.
I'm not sure how this is different from the solution I describe in the latter half of the post.
Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).
Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.
From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?
LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization toward "normality" (and also kinda quantilize by default). So maybe yeah, I think I agree with your statement in the sense that I think you intended it, as it refers to current technology. But it's not clear to me that this remains true if we made something-like-an-LLM that is genuinely creative (in the sense of being capable of finding genuinely-out-of-the-box plans that achieve a particular outcome). It depends on how exactly it implements its regularization/redundency/quantilization and whether that implementation works for the particular OOD tasks we use it for.
Ultimately I don't think LLM-ish vs RL-ish won't be the main alignment-relevant axis. RL trained agents will also understand natural language, and contain natural-language-relevant algorithms. Better to focus on understood vs not-understood.
Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.
The Alice and Bob example isn't a good argument against the independence axiom. The combined agent can be represented using a fact-conditional utility function. Include the event "get job offer" in the outcome space, so that the combined utility function is a function of that fact.
E.g.
Bob {A: 0, B: 0.5, C: 1}
Alice {A: 0.3, B: 0, C: 0}
Should merge to become
AliceBob {Ao: 0, Bo: 0.5, Co: 1, A¬o: 0, B¬o: 0, C¬o: 0.3}, where o="get job offer".
This is a far more natural way to combine agents. We can avoid the ontologically weird mixing of probabilities and preference implied by having preference () and also . Like... what does a geometrically rational agent actually care about, and why does it's preferences change depending on its own beliefs and priors? A fact-conditional utility function is ontologically cleaner. Agents care about events in the world (potentially in different ways across branches of possibility, but it's still fundamentally caring about events).
This removes all the appeal of geometric rationality for me. The remaining intuitive appeal comes from humans having preferences that are logarithmic in most resources, which is more simply represented as one utility function rather than as a geometric average of many.
Excited to attend, the 2023 conference was great!
Can we submit talks?
Yeah I can see how Scott's quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn't necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don't know how Scott is imagining this, but it needn't be an inner homunculi that has consistent goals.
I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn't put 0 probability on 'inner homunculi', but also didn't consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to[1]). E.g. Ajeya is clear at the beginning of this post that the training setup she's considering isn't the same as pretraining on a myopic prediction objective.
- ^
When he says 'I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities.'
but his takes were probably a little more predictably unwelcome in this venue
I hope he doesn't feel his takes are unwelcome here. I think they're empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I'm pretty sad that he largely stopped engaging with lesswrong.
There's definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.
Yeah I agree, that's why I like to read Alex's takes.
Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I'd love to do more of it myself.[1]
Some aspects were slightly disappointing:
- Alex keeps putting (inaccurate) words in the mouths of people he disagrees with, without citation. E.g.
- 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic).'
- That post was describing a very different kind of AI than generative language models. In particular, it is explicitly designed to minimize long run prediction error.[2] In fact, the surrounding posts in the sequence discuss myopia and suggest myopic algorithms might be more fundamental/incentivised by default.
- 'I think this is a better possible story than the "SGD selects for simplicity -> inner-goal structure" but I also want to note that the reason you give above is not the same as the historical supports offered for the homunculus.'
- "I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities. Off the top of my head: Eliezer and Nate calling this the "obvious result" of what you get from running ML processes. Ajeya writing about schemers, Evan writing about deceptive alignment (quite recently!), Habryka and Rob B saying similar-seeming things on Twitter" and "Again, I'm only critiquing the within-forward-pass version"
- I think you're saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
- 'And so people wasted a lot of time, I claim, worrying about that whole "how can I specify 'get my mother out of the building' to the outcome pump" thing'
- People spent time thinking about how to mitigate reward hacking? Yes. But that's a very reasonable problem to work on, with strong empirical feedback loops. Can you give any examples of people wasting time trying to specify 'get my mother out of the building'? I can't remember any. How would that even work?
- "And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude). "
- Who predicted this? You're making up bad predictions. Eliezer in particular has been pretty clear that he doesn't expect evidence of this form.
- 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic).'
- Alex seemed to occasionally enjoy throwing out insults sideways toward third parties.
- E.g. "the LW community has largely written fanfiction alignment research". I think communication between the various factions would go better if statements like this were written without deliberate intention to insult. It could have just been "the LW community has been largely working from bad assumptions".
But I'm really glad this was published, I learned something about both Oliver and Alex's models, and I'd think it was very positive even if there were more insults :)
- ^
If anyone is interested?
- ^
Quote from the post: "Predict-O-Matic will be objective. It is a machine of prediction, is it not? Its every cog and wheel is set to that task. So, the answer is simple: it will make whichever answer minimizes projected predictive error. There will be no exact ties; the statistics are always messy enough to see to that. And, if there are, it will choose alphabetically."
- ^
Relevant quote from Evan in that post:
"Question: Yeah, so would you say that, GPT-3 is on the extreme end of world modeling. As far as what it's learned in this training process?
What is GPT-3 actually doing? Who knows? Could it be the case for GPT-3 that as we train larger and more powerful language models, doing pre-training will eventually result in a deceptively aligned model? I think that’s possible. For specifically GPT-3 right now, I would argue that it looks like it’s just doing world modeling. It doesn’t seem like it has the situational awareness necessary to be deceptive. And, if I had to bet, I would guess that future language model pre-training will also look like that and won’t be deceptive. But that’s just a guess, and not a super confident one.
The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
Also, this all potentially changes if you start doing fine-tuning, like RLHF (reinforcement learning from human feedback). Then what you’re trying to get it to do might be quite complex—something like “maximize human approval.” If it has to learn a goal like that, learning the right proxies becomes a lot harder."
Tsvi has many underrated posts. This one was rated correctly.
I didn't previously have a crisp conceptual handle for the category that Tsvi calls Playful Thinking. Initially it seemed a slightly unnatural category. Now it's such a natural category that perhaps it should be called "Thinking", and other kinds should be the ones with a modifier (e.g. maybe Directed Thinking?).
Tsvi gives many theoretical justifications for engaging in Playful Thinking. I want to talk about one because it was only briefly mentioned in the post:
Your sense of fun decorrelates you from brain worms / egregores / systems of deference, avoiding the dangers of those.
For me, engaging in intellectual play is an antidote to political mindkilledness. It's not perfect. It doesn't work for very long. But it does help.
When I switch from intellectual play to a politically charged topic, there's a brief period where I'm just.. better at thinking about it. Perhaps it increases open-mindedness. But that's not it. It's more like increased ability to run down object-level thoughts without higher-level interference. A very valuable state of mind.
But this isn't why I play. I play because it's fun. And because it's natural? It's in our nature.
It's easy to throw this away under pressure, and I've sometimes done so. This post is a good reminder of why I shouldn't.