Posts
Comments
I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing.
Here's two ways that a high-level model can be wrong:
- It isn't detailed enough, but once you learn the detail it adds up to basically the same picture. E.g. Newtonian physics, ideal gas laws. When you get a more detailed model, you learn more about which edge-cases will break it. But the model basically still works, and is valuable for working out the more detailed model.
- It's built out of confused concepts. E.g. free will, consciousness (probably), many ways of thinking about personal identity, four humors model. We're basically better off without this kind of model and should start from scratch.
It sounds like you're saying high-level agency-as-outcome-directed is wrong in the second way? If so, I disagree, it looks much more like the first way. I don't think I understand your beliefs well enough to argue about this, maybe there's something I should read?
I have a discomfort that I want to try to gesture at:
Are you ultimately wanting to build a piece of software that solves a problem so difficult that it needs to modify itself? My impression from the post is that you are thinking about this level of capability in a distant way, and mostly focusing on much earlier and easier regimes. I think it's probably very easy to work on legible low-level capabilities without making any progress on the regime that matters.
To me it looks important for researchers to have this ultimate goal constantly in their mind, because there are many pathways off-track. Does it look different to you?
Ultimately, this is a governance problem, not a technical problem. The choice to choose illegible capabilities is a political one.
I think this is a bad place to rely on governance, given the fuzziness of this boundary and the huge incentive toward capability over legibility. Am I right in thinking that you're making a large-ish gamble here on the way the tech tree shakes out (such that it's easy to see a legible-illegible boundary, and the legible approaches are competitive-ish) and also the way governance shakes out (such that governments decide that e.g. assigning detailed blame for failures is extremely important and worth delaying capabilities)?
I'm glad you're doing ambitious things, and I'm generally a fan of trying to understand problems from scratch in the hope that they dissolve or become easier to solve.
Literally compute and man-power. I can't afford the kind of cluster needed to even begin a pretraining research agenda, or to hire a new research team to work on this. I am less bottlenecked on the theoretical side atm, because I need to run into a lot of bottlenecks from actual grounded experiments first.
Why would this be a project that requires large scale experiments? Looks like something that a random PhD student with two GPUs could maybe make progress on. Might be a good problem to make a prize for even?
because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future
I personally doubt that this is true, which is maybe the crux here.
Would you like to do a dialogue about this? To me it seems clearly true in exactly the same way that having more time to pursue a goal makes it more likely you will achieve that goal.
It's possible another crux is related to the danger of Goodharting, which I think you are exaggerating the danger of. When an agent actually understand what it wants, and/or understands the limits of its understanding, then Goodhart is easy to mitigate, and it should try hard to achieve its goals (i.e. optimize a metric).
There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level.
"the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than).
"than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks).
Or as Eliezer said:
I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.
In different words again: the tasks GPTs are being incentivised to solve aren't all solvable at a human level of capability.
You almost had it when you said:
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'. But this comparison does not seem to be particularly informative.
It's more accurate if I edit it to:
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina [text] well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'.
You say it's not particularly informative. Eliezer responds by explaining the argument it responds to, which provides the context in which this is an informative statement about the training incentives of a GPT.
The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.
Your central point is:
Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.
You are misinterpreting the OP by thinking it's about comparing the mathematical properties of two tasks, when it was just pointing at the loss gradient of the text prediction task (at the location of a ~human capability profile). The OP works through text prediction sub-tasks where it's obvious that the gradient points toward higher-than-human inference capabilities.
You seem to focus too hard on the minima of the loss function:
notice that “what would the loss function like the system to do” in principle tells you very little about what the system will do
You're correct to point out that the minima of a loss function doesn't tell you much about the actual loss that could be achieved by a particular system. Like you say, the particular boundedness and cognitive architecture are more relevant to this question. But this is irrelevant to the argument being made, which is about whether the text prediction objective stops incentivising improvements above human capability.
The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning
I think a better lesson to learn is that communication is hard, and therefore we should try not to be too salty toward each other.
I sometimes think of alignment as having two barriers:
- Obtaining levers that can be used to design and shape an AGI in development.
- Developing theory that predicts the effect of your design choices.
My current understanding of your agenda, in my own words:
You're trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You're collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming language features, because they are small and well-tested-ish. (1 & 2)
As new tactics are developed, you're hoping that expertise and robust theories develop around building systems this way. (3)
This by itself doesn't scale to hard problems, so you're trying to develop methods for learning and tracking knowledge/facts in a way that interfaces with the rest of it in a way that remains legible. (4)
Maybe with some additional tools, we build a relatively-legible emulation of human thinking on top of this paradigm. (5)
Have I understood this correctly?
I feel like the alignment section of this is missing. Is the hope that better legibility and experience allows us to solve the alignment problems that we expect at this point?
Maybe it'd be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
- Unexpected edge cases in value specification
- Goals stability across ontology shifts
- Reflective stability of goals
- Optimization daemons or simpler self-reinforcing biases
- Maintaining interruptibility against instrumental convergence
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it's risky to do the hard part last.
but because the same solutions that will make AI systems beneficial will also make them safer
This is often not true, and I don't think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
In practice, sadly, developing a true ELM is currently too expensive for us to pursue
Expensive why? Seems like the bottleneck here is theoretical understanding.
Yeah I read that prize contest post, that was much of where I got my impression of the "consensus". It didn't really describe which parts you still considered valuable. I'd be curious to know which they are? My understanding was that most of the conclusions made in that post were downstream of the Landauer limit argument.
Could you explain or directly link to something about the 4x claim? Seems wrong. Communication speed scales with distance not area.
Jacob Cannells' brain efficiency post
I thought the consensus on that post was that it was mostly bullshit?
These seem right, but more importantly I think it would eliminate investing in new scalable companies. Or dramatically reduce it in the 50% case. So there would be very few new companies created.
(As a side note: Maybe our response to this proposal was a bit cruel. It might have been better to just point toward some econ reading material).
would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints.
Good point, I'm convinced by this.
build on past agent foundations research
I don't really agree with this. Why do you say this?
That's my guess at the level of engagement required to understand something. Maybe just because when I've tried to use or modify some research that I thought I understood, I always realise I didn't understand it deeply enough. I'm probably anchoring too hard on my own experience here, other people often learn faster than me.
(Also I'm confused about the discourse in this thread (which is fine), because I thought we were discussing "how / how much should grantmakers let the money flow".)
I was thinking "should grantmakers let the money flow to unknown young people who want a chance to prove themselves."
I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.
The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually move away if they see other pathways after having understood the constraints. Now I see my work as mostly about trying to run into constraints for the purpose of better understand them.
Maybe that wouldn't help though, it's really hard to make people see the constraints.
The main thing I'm referring to are upskilling or career transition grants, especially from LTFF, in the last couple of years. I don't have stats, I'm assuming there were a lot given out because I met a lot of people who had received them. Probably there were a bunch given out by the ftx future fund also.
Also when I did MATS, many of us got grants post-MATS to continue our research. Relatively little seems to have come of these.
How are they falling short?
(I sound negative about these grants but I'm not, and I do want more stuff like that to happen. If I were grantmaking I'd probably give many more of some kinds of safety research grant. But "If a man has an idea just give him money and don't ask questions" isn't the right kind of change imo).
I think I disagree. This is a bandit problem, and grantmakers have tried pulling that lever a bunch of times. There hasn't been any field-changing research (yet). They knew it had a low chance of success so it's not a big update. But it is a small update.
Probably the optimal move isn't cutting early-career support entirely, but having a higher bar seems correct. There are other levers that are worth trying, and we don't have the resources to try every lever.
Also there are more grifters now that the word is out, so the EV is also declining that way.
(I feel bad saying this as someone who benefited a lot from early-career financial support).
My first exposure to rationalists was a Rationally Speaking episode where Julia recommended the movie Locke.
It's about a man pursuing difficult goals under emotional stress using few tools. For me it was a great way to be introduced to rationalism because it showed how a ~rational actor could look very different from a straw Vulcan.
It's also a great movie.
Nice.
Similar rule of thumb I find handy is 70 divided by growth rate to get doubling time implied by a growth rate. I find it way easier to think about doubling times than growth rates.
E.g. 3% interest rate means 70/3 ≈ 23 year doubling time.
I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.
I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression that he's a rock with "DOOM" written on it.
I think the pessimism comes from there being several currently-unsolved problems that get in the way of "deeply-understood enough". In principle it's possible to understand these problems and hand-build a safe and stable AGI, it just looks a lot easier to hand-build an AGI without understanding them all, and even easier than that to train an AGI without even thinking about them.
I call most of these "instability" problems. Where the AI might for example learn more, or think more, or self-modify, and each of these can shift the context in a way that causes an imperfectly designed AI to pursue unintended goals.
Here are some descriptions of problems in that cluster: optimization daemons, ontology shifts, translating between our ontology and the AI's internal ontology in a way that generalizes, pascal's mugging, reflectively stable preferences & decision algorithms, reflectively stable corrigibility, and correctly estimating future competence under different circumstances.
Some may be resolved by default along the way to understanding how to build AGI by hand, but it isn't clear. Some are kinda solved already in some contexts.
Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.
IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.
E.g. project euler problems.
When I said "problems we care about", I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I'm referring to.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can't expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!
I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.
I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it's unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the "useful" assumption.
I'd be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn't involve failing and reevaluating high-level strategy every now and then.
Extremely underrated post, I'm sorry I only skimmed it when it came out.
I found 3a,b,c to be strong and well written, a good representation of my view.
In contrast, 3d I found to be a weak argument that I didn't identify with. In particular, I don't think internal conflicts are a good way to explain the source of goal misgeneralization. To me it's better described as just overfitting or misgeneralization.[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathways continue to fail. Whereas thinking of the AI as needing to resolve conflicting values seems to me to be anthropomorphizing in a way that doesn't seem to transfer to most mind designs.
You also used the word coherent in a way that I didn't understand.
Human intelligence seems easily useful enough to be a major research accelerator if it can be produced cheaply by AI
I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample
It's a good observation that humans seem better at stepping back inside of low-level tasks than at high-level life-purposes. For example, I got stuck on a default path of finishing a neuroscience degree, even though if I had reflected properly I would have realised it was useless for achieving my goals a couple of years earlier. I got got by sunk costs and normality.
However, I think this counterexample isn't as strong as you think it is. Firstly because it's incredibly common for people to break out of a default-path. And secondly because stepping back is usually proceeded by some kind of failure to achieve the goal using a particular approach. Such failures occur often at small scales. They occur infrequently in most people's high-level life plans, because such plans are fairly easy and don't often raise flags that indicate potential failure. We want difficult work out of an AI. This implies frequent total failure, and hence frequent high-level stepping back. If it's doing alignment research, this is particularly true.
- ^
Like for reasons given in section 4 of the misalignment and catastrophe doc.
Trying to write a new steelman of Matt's view. It's probably incorrect, but seems good to post as a measure of progress:
You believe in agentic capabilities generalizing, but also in additional high-level patterns that generalize and often overpower agentic behaviour. You expect training to learn all the algorithms required for intelligence, but also pick up patterns in the data like "research style", maybe "personality", maybe "things a person wouldn't do" and also build those into the various-algorithms-that-add-up-to-intelligence at a deep level. In particular, these patterns might capture something like "unwillingness to commandeer some extra compute" even though it's easy and important and hasn't been explicitly trained against. These higher level patterns influence generalization more than agentic patterns do, even though this reduces capability a bit.
One component of your model that reinforces this: Realistic intelligence algorithms rely heavily on something like caching training data and this has strong implications about how we should expect them to generalize. This gives an inductive-bias advantage to the patterns you mention, and a disadvantage to think-it-through-properly algorithms (like brute force search, or even human-like thinking).
We didn't quite get to talking about reflection, but this is probably the biggest hurdle in the way of getting such properties to stick around. I'll guess at your response: You think that an intelligence that doesn't-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.
Not much to add, I haven't spent enough time thinking about structural selection theorems.
I'm a fan of making more assumptions. I've had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. "How could an agent store a utility function over all possible worlds?" or "Rice's theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it's impossible to do xyz". The answer is always nah, it's possible, we just need to take advantage of some structure in the problem.
Finding the right assumptions is really hard though, it's easy to oversimplify the problem and end up with something useless.
Good point.
What I meant by updatelessness removes most of the justification is the reason given here at the very beginning of "Against Resolute Choice". In order to make a money pump that leads the agent in a circle, the agent has to continue accepting trades around a full preference loop. But if it has decided on the entire plan beforehand, it will just do any plan that involves <1 trip around the preference loop. (Although it's unclear how it would settle on such a plan, maybe just stopping its search after a given time). It won't (I think?) choose any plan that does multiple loops, because they are strictly worse.
After choosing this plan though, I think it is representable as VNM rational, as you say. And I'm not sure what to do with this. It does seem important.
However, I think Scott's argument here satisfies (a) (b) and (c). I think the independence axiom might be special in this respect, because the money pump for independence is exploiting an update on new information.
I think the problem might be that you've given this definition of heuristic:
A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
Taking this definition seriously, it's easy to decompose a forward pass into such functions.
But you have a much more detailed idea of a heuristic in mind. You've pointed toward some properties this might have in your point (2), but haven't put it into specific words.
Some options: A single heuristic is causally dependent on <5 heuristics below and influences <5 heuristics above. The inputs and outputs of heuristics are strong information bottlenecks with a limit of 30 bits. The function of a heuristic can be understood without reference to >4 other heuristics in the same layer. A single heuristic is used in <5 different ways across the data distribution. A model is made up of <50 layers of heuristics. Large arrays of parallel heuristics often output information of the same type.
Some combination of these (or similar properties) would turn the heuristics intuition into a real hypothesis capable of making predictions.
If you don't go into this level of detail, it's easy to trick yourself into thinking that (2) basically kinda follows from your definition of heuristics, when it really really doesn't. And that will lead you to never discover the value of the heuristics intuition, if it is true, and never reject it if it is false.
I've only skimmed this post, but I like it because I think it puts into words a fairly common model (that I disagree with). I've heard "it's all just a stack of heuristics" as an explanation of neural networks and as a claim that all intelligence is this, from several people. (Probably I'm overinterpreting other people's words to some extent, they probably meant a weaker/nuanced version. But like you say, it can be useful to talk about the strong version).
I think you've correctly identified the flaw in this idea (it isn't predictive, it's unfalsifiable, so it isn't actually explaining anything even if it feels like it is). You don't seem to think this is a fatal flaw. Why?
You seem to answer
However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.
But I don't see why "heuristics based decompositions will be human-understandable" is an implication of the theory. As an extreme counterexample, logic gates are interpretable, but when stacked up into a computer they are ~uninterpretable. It looks to me like you've just tacked an interpretability hypothesis onto a heuristics hypothesis.
Trying to think this through, I'll write a bit of a braindump just in case that's useful:
The futachy hack can be split into two parts. The first is that is that conditioning on untaken actions makes most probabilities ill-defined. Because there are no incentives to get it right, the market can can settle to many equilibria. The second part is that there are various incentives for traders to take advantage of this for their own interests.
With your technique, I think approach would be to duplicate each trader into two traders with the same knowledge, and make their joint earnings zero sum.[1]
This removes one explicit incentive for a single trader to manipulate a value to cause a different action to happen. But only if it's doing so to make the distribution easier to predict and thereby improving their score. Potentially there are still other incentives i.e. if the trader has preferences over the world, and these aren't eliminated.
Why doesn't this happen in LI already? LI is zero sum overall, because there is a finite pool of wealth. But this is shared among traders with different knowledge. If there is a wealthiest trader that has a particular piece of knowledge, it should manipulate actions to reduce variance to get a higher score. So the problem is that it's not zero-sum with respect to each piece of knowledge.
But, the first issue is entirely unresolved. The probabilities that condition on untaken actions will be path-dependent leftovers from the convergence procedure of LI, when the market was more uncertain about which action will be taken. I'd expect these to be fairly reasonable, but they don't have to be.
This reasonableness is coming from something though, and maybe this can be formalized.
- ^
You'd have to build a lot more structure into the LI traders to guarantee they can't learn to cooperate and are myopic. But that seems doable. And its the sort of thing I'd want to do anyway.
I appreciate that you tried. If words are failing us to this extent, I'm going to give up.
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
So it's the AI being incompetent?
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Yeah I think would be a good response to my argument against premise 2). I've had a quick look at the list of theorems in the paper, I don't know most of them, but the ones I do know don't seem to support the point you're making. So I don't buy it. You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.
You don't have to simulate something to reason about it.
E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs?
Garrabrant induction shows one way of doing self-referential reasoning.
- But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.
I've reread and my understanding of point 3 remains the same. I wasn't trying to summarize points 1-5, to be clear. And by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.
Perhaps you could point me to my misunderstanding?
In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of human extinction above 99%.
This part is invalid, I think.
My understanding of this argument is: 1) There is an extremely powerful agent, so powerful that if it wanted to it could cause human extinction. 2) There is some risk of its goal-related systems breaking, and this risk doesn't rapidly decrease over time. Therefore the risk adds up over time and converges toward 1.
This argument doesn't work because the two premises won't hold. For 2) An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure. For 1) Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.
So the risk of everyone dying should only come from a relatively short period after an agent (or agents) become powerful enough that killing everyone is an ~easy option.
To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.
Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?
I'm generally confused about this. Do you have thoughts?
The non-spicy answer is probably the LTFF, if you're happy deferring to the fund managers there. I don't know what your risk tolerance for wasting money is, but you can check whether they meet it by looking at their track record.
If you have a lot of time you might be able to find better ways to spend money than the LTFF can. (Like if you can find a good way to fund intelligence amplification as Tsvi said).
Are you aware that this is incompatible with Thornley's ideas about incomplete preferences? Thornley's decision rule might choose A.
I retract this part of the comment. I misinterpreted the comment that I linked to. Seems like they are compatible.
This is in contrast to Thornley's rule, which does sometimes choose the bottom path of the money pump, which makes it impossible to represent as a EU maximizer. This seems like real incomplete preferences.
It seems incorrect to me to describe Peterson's argument as formalizing the same counter-argument further (as you do in the paper), given how their proposals seem to have quite different properties and rely on different arguments.
I think I was wrong about this. I misinterpreted a comment made by Thornley, sorry! See here for details.
I think the above money pump works, if the agent sometimes chooses the A path, but I was incorrect in thinking that the caprice rule sometimes chooses the A path.
I misinterpreted one of EJT's comments as saying it might choose the A path. The last couple of days I've been reading through some of the sources he linked to in the original "there are no coherence theorems" post and one of them (Gustafsson) made me realize I was interpreting him incorrectly, by simplifying the decision tree in a way that doesn't make sense. I only realized this yesterday.
Now I think that the caprice rule is essentially equivalent to updatelessness. If I understand correctly, it would be equivalent to 1. choosing the best policy by ranking them in the partial order of outcomes (randomizing over multiple maxima), then 2. implementing that policy without further consideration. And this makes it immune to money pumps and renders any self-modification pointless. It also makes it behaviorally indistinguishable from an agent with complete preferences, as far as I can tell.
The same updatelessness trick seems to apply to all money pump arguments. It's what scott uses in this post to avoid the independence money pump.
So currently I'm thinking updatelessness removes most of the justification for the VNM axioms (including transitivity!). But I'm confused because updateless policies still must satisfy local properties like "doesn't waste resources unless it helps achieve the goal", which is intuitively what the money pump arguments represent. So there must be some way to recover properties like this. Maybe via John's approach here.
But I'm only maybe 80% sure of my new understanding, I'm still trying to work through it all.
Yeah I'm on board with deontological-injunction shaped constraints. See here for example.
Perhaps instead of "attempting to achieve the goal at any cost" it would be better to say "being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal".
Nah I still disagree. I think part of why I'm interpreting the words differently is because I've seen them used in a bunch of places e.g. the lightcone handbook to describe the lightcone team. And to describe the culture of some startups (in a positively valenced way).
Being willing to be creative and unconventional -- sure, but this is just part of being capable and solving previously unsolved problems. But disregarding conventions that are important for cooperation that you need to achieve your goals? That's ridiculous.
Being willing to impose costs on uninvolved parties can't be what is implied by 'going hard' because that depends on the goals. An agent that cares a lot about uninvolved parties can still go hard at achieving its goals.
I suspect we may be talking past each other here.
Unfortunately we are not. I appreciate the effort you put into writing that out, but that is the pattern that I understood you were talking about, I just didn't have time to write out why I disagreed.
I expect this to continue to be true in the future
This is the main point where I disagree. The reason I don't buy the extrapolation is that there are some (imo fairly obvious) differences between current tech and human-level researcher intelligence, and those differences appear like they should strongly interfere with naive extrapolation from current tech. Tbh I thought things like o1 or alphaproof might cause the people who naively extrapolate from LLMs to notice some of these, because I thought they were simply overanchoring on current SoTA, and since the SoTA has changed I thought they would update fast. But it doesn't seem to have happened much yet. I am a little confused by this.
What observations lead you to suspect that this is a likely failure mode?
I didn't say likely, it's more an example of an issue that comes up so far when I try to design ways to solve other problems. Maybe see here for instabilities in trained systems, or here for more about that particular problem.
I'm going to drop out of this conversation now, but it's been good, thanks! I think there are answers to a bunch of your claims in my misalignment and catastrophe post.
My third link is down the thread from your link. I agree that from an outside view it's difficulty to work out who is right. Unfortunately in this case one has to actually work through the details.
That post is clickbait. It only argues that the incompleteness money pump doesn't work. The reasons the incompleteness money pump does work is well summarized at a high level here or here (more specifically here and here, if we get into details).
Humans do seem to have strong preferences over immediate actions.
I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
Yeah same. Although legible commitments or decision theory can serve the same purpose better, it's probably harder to evolve because it depends on higher intelligence to be useful. The level of transparency of agents to each other and to us seems to be an an important factor. Also there's some equilibrium, e.g. in an overly honest society it pays to be a bit more dishonest, etc.
It does unfortunately seem easy and useful to learn rules like honest-to-tribe or honest-to-people-who-can-tell or honest-unless-it's-really-important or honest-unless-I-can-definitely-get-away-with-it.
attempting to achieve their long-term goal at any cost
I think if you remove "at any cost", it's a more reasonable translation of "going hard". It's just attempting to achieve a long-term goal that is hard to achieve. I'm not sure what "at any cost" adds to it, but I keep on seeing people add it, or add monomaniacally, or ruthlessly. I think all of these are importing an intuition that shouldn't be there. "Going hard" doesn't mean throwing out your morality, or sacrificing things you don't want to sacrifice. It doesn't mean being selfish or unprincipled such that people don't cooperate with you. That would defeat the whole point.
It's not a binary choice between "care about process" and "care about outcomes" - it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
Yes!
It does seem to me that "we have a lot of control over the approaches the agent tends to take" is true and becoming more true over time.
No!
I doubt that systems trained with ML techniques have these properties. But I don't think e.g. humans or organizations built out of humans + scaffolding have these properties either
Yeah mostly true probably.
and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
I'm talking about stability properties like "doesn't accidentally radically change the definition of its goals when updating its world-model by making observations". I agree properties like this don't seem to be on the fastest path to build AGI.
I think you are mischaracterizing my beliefs here.
"almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence."
This seems right to me. Maybe see my comment further up, I think it's relevant to arguments we've had before.
This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior?
We can't say much about the detailed internal structure of an agent, because there's always a lot of ways to implement an algorithm. But we do only care about (generalizing) behavior, so we only need some very abstract properties relevant to that.
I like this exchange and the clarifications on both sides. I'll add my response:
You're right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it's multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions. But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren't determined by near-term constraints, when longer-term goals dominate.
(a relevant part of Eliezer's recent thread is "then probably one of those pieces runs over enough of the world-model (or some piece of reality causally downstream of enough of the world-model) that It can always do a little better by expending one more erg of energy.", but it should be read in context)
Another missing piece here might be: The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there. This is the thing that makes agents useful and valuable. And it's the main thing that separates agents from most other computer programs.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes. This does seem important in terms of alignment solutions. And it takes some steam out of the arguments that go "coherent therefore incorrigible" (or it at least should add some caveats). But this only helps us if we have a lot of control over the preferences&constraints of the agent, and it has a couple of stability properties.
it'll choose something other than A.
Are you aware that this is incompatible with Thornley's ideas about incomplete preferences? Thornley's decision rule might choose A. [Edit: I retract this, it's wrong].
But suppose the agent were next to face a choice
If the choices are happening one after the other, are the preferences over tuples of outcomes? Or are the two choices in different counterfactuals? Or is it choosing an outcome, then being offered another outcome set that it could to replace it with?
VNM is only well justified when the preferences are over final outcomes, not intermediate states. So if your example contains preferences over intermediate states, then it confuses the matter because we can attribute the behavior to those preferences rather than incompleteness.
If you don't agree with Eliezer on 90% of the relevant issues, it's completely unconvincing.
Of course. What kind of miracle are you expecting?
It also doesn't go into much depth on many of the main counterarguments. And doesn't go into enough detail that it even gets close to "logically sound". And it's not as condensed as I'd like. And it skips over a bunch of background. Still, it's valuable, and it's the closest thing to a one-post summary of why Eliezer is pessimistic about the outcome of AGI.
The main value of list of lethalities as a one-stop shop is that you can read it and then be able to point to roughly where you disagree with Eliezer. And this is probably what you want if you're looking for canonical arguments for AI risk. Then you can look further into that disagreement if you want.
Reading the rest of your comment very charitably: It looks like your disagreements are related to where AGI capability caps out, and whether default goals involve niceness to humans. Great!
If I read your comment more literally, my guess would be that you haven't read list of lethalities, or are happy misrepresenting positions you disagree with.
he takes as an assumption that an AGI will be godlike level omnipotent
He specifically defines a dangerous intelligence level as around the level required to design and build a nanosystem capable of building a nanosystem (or any of several alternative example capabilities) (In point 3). Maybe your omnipotent gods are lame.
and that it will default to murderism
This is false. Maybe you are referring to how there isn't any section justifying instrumental convergence? But it does have a link, and it notes that it's skipping over a bunch of background in that area (-3). That would be a different assumption, but if you're deliberately misrepresenting it, then that might be the part that you are misrepresenting.
If you're looking for recent, canonical one-stop-shop, the answer is List of Lethalities.
(Just tried having claude turn the thread into markdown, which seems to have worked):
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 3
Should AI be aligned with human preferences, rewards, or utility functions? Excited to finally share a preprint that @MicahCarroll @FranklinMatija @hal_ashton & I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!
This paper (https://arxiv.org/abs/2408.16984) is at once a critical review & research agenda. In it we characterize the role of preferences in AI alignment in terms of 4 preferentist theses. We then highlight their limitations, arguing for alternatives that are ripe for further research.
Our paper addresses each of the 4 theses in turn:
- T1. Rational choice theory as a descriptive theory of humans
- T2. Expected utility theory as a normative account of rational agency
- T3. Single-human AI alignment as pref. matching
- T4. Multi-human AI alignment as pref. aggregation
Addressing T1, we examine the limitations of modeling humans as (noisy) maximizers of utility functions (as done in RLHF & inverse RL), which fails to account for:
- Bounded rationality
- Incomplete preferences & incommensurable values
- The thick semantics of human values
As alternatives, we argue for:
- Modeling humans as resource-rational agents
- Accounting for how we do or do not commensurate / trade-off our values
- Learning the semantics of human evaluative concepts, which preferences do not capture
We then turn to T2, arguing that expected utility (EU) maximization is normatively inadequate. We draw on arguments by @ElliotThornley & others that coherent EU maximization is not required for AI agents. This means AI alignment need not be framed as "EU maximizer alignment".
Jeremy Gillen @jeremygillen1 · Sep 4
I'm fairly confident that Thornley's work that says preference incompleteness isn't a requirement of rationality is mistaken. If offered the choice to complete its preferences, an agent acting according to his decision rule should choose to do so.
As long as it can also shift around probabilities of its future decisions, which seems reasonable to me. See Why Not Subagents?
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4
Hi! So first I think it's worth clarifying that Thornley is focusing on what advanced AI agents will do, and is not as committed to saying something about the requirements of rationality (that's our interpretation).
But to the point of whether an agent would/should choose to complete its preferences, see Sami Petersen's more detailed argument on "Invulnerable Incomplete Preferences":
Regarding the trade between (sub)agents argument, I think that only holds in certain conditions -- I wrote a comment on that post discussing one intuitive case where trade is not possible / feasible.
Oops sorry I see you were linking to a specific comment in that thread -- will read, thanks!
Hmm okay, I read the money pump you proposed! It's interesting but I don't buy the move of assigning probabilities to future decisions. As a result, I don't think the agent is required to complete its preferences, but can just plan in advance to go for A+ or B.
I think Petersen's "Dynamic Strong Maximality" decision rule captures that kind of upfront planning (in a way that may go beyond the Caprice rule) while maintaining incompleteness, but I'm not 100% sure.
Yeah, there's a discussion of this in footnote 16 of the Petersen article: https://alignmentforum.org/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1#fnrefr2zvmaagbir
Jeremy Gillen @jeremygillen1 · Sep 4
The move of assigning probabilities to future actions was something Thornley started, not me. Embedded agents should be capable of this (future actions are just another event in the world). Although doesn't work with infrabeliefs, so maybe in that case the money pump could break.
I'm not as familiar with Petersen's argument, but my impression is that it results in actions indistinguishable from those of an EU maximizer with completed preferences (in the resolute choice case). Do you know any situation where it isn't representable as an EU maximizer?
This is in contrast to Thornley's rule, which does sometimes choose the bottom path of the money pump, which makes it impossible to represent as a EU maximizer. This seems like real incomplete preferences.
It seems incorrect to me to describe Peterson's argument as formalizing the same counter-argument further (as you do in the paper), given how their proposals seem to have quite different properties and rely on different arguments.
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4
I wasn't aware of this difference when writing that part of the paper! But AFAIK Dynamic Strong Maximality generalizes the Caprice rule, so that it behaves the same on the single-souring money pump, but does the "right thing" in the single-sweetening case.
Regarding whether DSM-agents are representable as EU maximizers, Petersen has a long section on this in the article (they call this the "Tramelling Concern"):
Jeremy Gillen @jeremygillen1 · 21h
Section 3.1 seems consistent with my understanding. Sami is saying that that the DSM-agent arbitrarily chooses a plan among those that result in one of the maximally valued outcomes.
He calls this untrammeled, because even though the resulting actions could have been generated by an agent with complete preferences, it "could have" made another choice at the beginning.
But this kind of "incompleteness" looks useless to me. Intuitively: If AI designers are happy with each of several complete sets of preferences, they could arbitrarily choose one and then put them into an agent with complete preferences.
All Sami's approach does is let the AI do exactly that arbitrary choice just before it starts acting. If you want an locally coherent AI tool, as you discuss later in the paper, this approach won't help you.
You can get the kind of Taskish behavior you want by being very careful about the boundedness and locality of the preferences, and using separate locally bounded Tool AIs each with a separate task (as you describe in the paper).
But the local completeness proposal at the end of 3.2 in your paper will break if it is capable of weak forms of self-modification or commitment, due to the money pump argument.
I do think it's possible to make such local Taskish agents work. You'd just need to exploit the main problem with VNM, which is that it doesn't allow preferences over non-terminal outcomes.
Sorry for being so critical, overall I think the paper is good and all of the arguments I looked at outside of section 3 seem strong. Well done.
xuan (ɕɥɛn / sh-yen) @xuanalogue · 20h
Thank you! I'm still not sure if DSM-agents will have that failure mode. I think the cases that seem most important to think through is how they'll behave under uncertainty about whether shutdown / context switching will happen.
At least in the full Thornley proposal, there are cases where it's clear to me that having complete preferences over trajectory lengths will cause shutdown-avoidance/seeking, and that DSM-agents with incomplete preferences over trajectory lengths will avoid.
Perhaps those DSM-agents can represented as having complete prefences once they've committed to a particular plan/policy. But if so then it seems like that preference ordering will have to be over something other than trajectories.
Jeremy Gillen @jeremygillen1 · 17h
I've been usually assuming the preferences are over final outcomes, as usual in VNM. Incompleteness is kinda useless if the preferences are over trajectories, because any behavior can be implemented anyway.
I think you're wrong that DSM-agents with incomplete preferences will avoid shutdown-avoidance/seeking. I'd be interested to hear the cases that are clear to you.
I've constructed a toy scenario that combines my money pump with the classic button manipulation scenario (for utility indifference) from section 4.2 of the Corrigibility paper.
The agent first makes an observation. Then it has an opportunity to deliberately trigger shutdown (if it does, then length 1 trajectory). If it doesn't shutdown, it does a length 2 trajectory, whose value depends on the initial observation.
With the observation it could get lucky (90%) or unlucky (10%). Possible outcomes are L, U and S. Preferences are L > U, L||S, U||S. (All different length trajectories are pref-gapped, as per Thornley's proposal).
Let's say that by default the agent deliberately chooses shutdown 50% of the time. (or any arbitrary probability, since, as per Thornley's proposal, stochastic choice between trajectory lengths is indicative of preferential gaps). Then the overall outcome is {L:45%, U:5%, S:50%}.
Then an agent could do strictly better by committing at the beginning to cause shutdown if it observes the unlucky observation, and cause shutdown 44.4% of the time if it observes the lucky observation.
The resulting outcome is {L:50%, U:0%, S:50%}. Since the relative probabilities of each trajectory length haven't changed, this should statewise dominate the outcome of the previous policy, so it is chosen by Thornley's decision rule. It's a pure shift of probability from U to L.
Under DSM (with resolute choice), the agent never implements a plan that is suboptimal from an earlier perspective, so it will also choose beforehand to commit to push the shutdown button if it observes the unlucky observation.
(This is a simplification, since the human button presser is missing, but it all works the same if you add the human and have actions that disconnect or press the button. I left all button control to the agent because it's sufficient to demonstrate that it will manipulate).
xuan (ɕɥɛn / sh-yen) @xuanalogue · 16h
Thanks, I'll think about this! The cases I think where complete preferences over trajectory lengths lead to shutdown avoidance are on pg. 42-43 of the original IPP write-up:
BTW it also seems like Thornley discusses cases like yours as forms of "Managing the News" in Section 15 of that same document, and acknowledges that it is a limitation!
I had a good discussion with xuan on twitter about incomplete preferences. It was about some of the arguments in the new paper Beyond Preferences in AI Alignment. The arguments were based on the work of EJT and Sami Peterson, both of which I think are completely mistaken.
Am I correct in thinking that with Strong Maximality and resolute choice applied to a single sweetening money pump, an agent will never take the bottom pathway, because it eliminates the A plan because A+ plan is strictly preferred?
If so, what's an example of a decision tree where the actions of an agent with incomplete preferences can't be represented as an agent with complete preferences?
That can't be right in general. Normal nash equilibria can narrow down predictions of actions. E.g. competition game. This is despite each player's decision being dependent on the other player's action.
I think your comment illustrates my point. You're describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you've not made any comment about why the goal-directedness doesn't affect all the nice tool-like properties.
don't see any obvious reason to expect much more cajoling to be necessary
It's the difference in levels of goal-directedness. That's the reason.
For example, I'm pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work
I'm not completely sure what happens when you try this. But there seem to be two main options. Either you've got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else's problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you've got a large collection of not-quite-agents that aren't really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That's a rather small resource. So your speedup isn't massive, it's only moderate, and you're on a time limit and didn't put much effort into getting a head start.