TurnTrout's shortform feed

post by TurnTrout · 2019-06-30T18:56:49.775Z · LW · GW · 581 comments

Contents

581 comments

581 comments

Comments sorted by top scores.

comment by TurnTrout · 2022-06-27T18:34:30.792Z · LW(p) · GW(p)

Rationality exercise: Take a set of Wikipedia articles on topics which trainees are somewhat familiar with, and then randomly select a small number of claims to negate (negating the immediate context as well, so that you can't just syntactically discover which claims were negated). 

For example:

By the time they are born, infants can recognize and have a preference for their mother's voice suggesting some prenatal development of auditory perception.

-> modified to

Contrary to early theories, newborn infants are not particularly adept at picking out their mother's voice from other voices. This suggests the absence of prenatal development of auditory perception.

Sometimes, trainees will be given a totally unmodified article. For brevity, the articles can be trimmed of irrelevant sections. 

Benefits:

  • Addressing key rationality skills. Noticing confusion; being more confused by fiction than fact; actually checking claims against your models of the world.
    • If you fail, either the article wasn't negated skillfully ("5 people died in 2021" -> "4 people died in 2021" is not the right kind of modification), you don't have good models of the domain, or you didn't pay enough attention to your confusion. 
    • Either of the last two are good to learn.
  • Scalable across participants. Many people can learn from each modified article.
  • Scalable across time. Once a modified article has been produced, it can be used repeatedly.
  • Crowdsourcable. You can put out a bounty for good negated articles, run them in a few control groups, and then pay based on some function of how good the article was. Unlike original alignment research or CFAR technique mentoring, article negation requires skills more likely to be present outside of Rationalist circles.

I think the key challenge is that the writer must be able to match the style, jargon, and flow of the selected articles. 

Replies from: Morpheus, yitz, TurnTrout
comment by Morpheus · 2022-06-28T19:30:50.975Z · LW(p) · GW(p)

I remember the magazine I read as a kid (Geolino) had a section like this (something like 7 news stories from around the World and one is wrong). It's german only, though I'd guess a similar thing to exist in english media?

comment by Yitz (yitz) · 2022-06-28T00:25:39.069Z · LW(p) · GW(p)

This is a lot like Gwern’s idea for a fake science journal club, right? This sounds a lot easier to do though, and might seriously be worth trying to implement.

comment by TurnTrout · 2022-10-06T00:14:45.937Z · LW(p) · GW(p)

Additional exercise: Condition on something ridiculous (like apes having been continuously alive for the past billion years), in addition to your own observations (your life as you've lived it). What must now be true about the world? What parts of your understanding of reality are now suspect?

comment by TurnTrout · 2023-12-17T02:03:27.310Z · LW(p) · GW(p)

In an alternate universe, someone wrote a counterpart to There's No Fire Alarm for Artificial General Intelligence [LW · GW]:

Okay, let’s be blunt here. I don’t think most of the discourse about alignment being really hard is being generated by models of machine learning at all. I don’t think we’re looking at wrong models; I think we’re looking at no models.

I was once at a conference where there was a panel full of famous AI alignment luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI alignment is really hard and unaddressed by modern alignment research, except for two famous AI luminaries who stayed quiet and let others take the microphone.

I got up in Q&A and said, “Okay, you’ve all told us that alignment is hard. But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.”

There was a silence.

Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. They named “Running a factory competently."

A few months after that panel, there was unexpectedly a big breakthrough on LLM/management integration.

The point is the silence that fell after my question, and that eventually I only got one reply, spoken in tentative tones. When I asked for concrete feats that were impossible in the next two years, I think that that’s when the luminaries on that panel switched to trying to build a mental model of future progress in AI alignment, asking themselves what they could or couldn’t predict, what they knew or didn’t know. And to their credit, most of them did know their profession well enough to realize that forecasting future boundaries around a rapidly moving field is actually really hard, that nobody knows what will appear on arXiv next month, and that they needed to put wide credibility intervals with very generous upper bounds on how much progress might take place twenty-four months’ worth of arXiv papers later.

(Also, Rohin Shah was present, so they all knew that if they named something insufficiently impossible, Rohin would have DeepMind go and do it.)

The question I asked was in a completely different genre from the panel discussion, requiring a mental context switch: the assembled luminaries actually had to try to consult their rough, scarce-formed intuitive models of progress in AI alignment and figure out what future experiences, if any, their model of the field definitely prohibited within a two-year time horizon. Instead of, well, emitting socially desirable verbal behavior meant to kill that darned optimism around AGI alignment and get some predictable applause from the audience.

I’ll be blunt: I don’t think the confident doom-and-gloom is entangled with non-social reality. If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.

Replies from: ryan_greenblatt, TRW, faul_sname, Thane Ruthenis, daniel-kokotajlo, thomas-kwa, quetzal_rainbow
comment by ryan_greenblatt · 2023-12-17T02:58:05.779Z · LW(p) · GW(p)

[Somewhat off-topic]

Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. They named “Running a factory competently."

I like thinking about the task "speeding up the best researchers by 30x" (to simplify, let's only include research in purely digital (software only) domains).

To be clear, I am by no means confident that this can't be done safely or non-agentically. It seems totally plausible to me that this can be accomplished without agency except for agency due to the natural language outputs of an LLM agent. (Perhaps I'm at 15% [LW(p) · GW(p)] that this will in practice be done without any non-trivial agency that isn't visible in natural language.)

(As such, this isn't a good answer to the question of "I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.". I think there probably isn't any interesting answer to this question for me due to "very confident" being a strong condition.)

I like thinking about this task because if we were able to speed up generic research on purely digital domains by this large of an extent, safety research done with this speed up would clearly obsolete prior safety research pretty quickly.

(It also seems likely that we could singularity quite quickly from this point if wanted to, so it's not clear we'll have a ton of time at this capability level.)

comment by __RicG__ (TRW) · 2023-12-29T18:24:46.245Z · LW(p) · GW(p)

If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.

 

Sorry, I might misunderstanding you (and hope I am), but... I think doomers literally say "Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed". The problems is not having the science to confidently say how the AIs will turn out, and not that doomers have a secret method to know that next-token-prediction is evil.

If you meant that doomers are too confident answering the question "will SGD even make motivational structures?" their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.

 

I also don't see how it logically follows that "If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks" => "then you ought to be able to say much weaker things that are impossible in two years" but it seems to be the core of the post. Even if anyone had the extraordinary model to predict what SGD exactly does (which we, as a species, should really strive for!!) it would still be a different question to predict what will or won't happen in the next two years.

If I reason about my field (physics) the same should hold for a sentence structured like "If your model has the extraordinary power to say how an array of neutral atoms cooled to a few nK will behave when a laser is shone upon them" (which is true) => "then you ought to be able to say much weaker things that are impossible in two years in the field of cold atom physics" (which is... not true). It's a non sequitur.

Replies from: TurnTrout
comment by TurnTrout · 2024-01-01T19:47:24.587Z · LW(p) · GW(p)

If you meant that doomers are too confident answering the question "will SGD even make motivational structures?" their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.

It would be "useful" (i.e. fitness-increasing) for wolves to have evolved biological sniper rifles, but they did not [LW(p) · GW(p)]. By what evidence are we locating these motivational hypotheses, and what kinds of structures are dangerous, and why are they plausible under the NN prior [LW(p) · GW(p)]? 

I also don't see how it logically follows that "If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks" => "then you ought to be able to say much weaker things that are impossible in two years" but it seems to be the core of the post.

The relevant commonality is "ability to predict the future alignment properties and internal mechanisms of neural networks." (Also, I don't exactly endorse everything in this fake quotation, so indeed the analogized tasks aren't as close as I'd like. I had to trade off between "what I actually believe" and "making minimal edits to the source material.")

comment by faul_sname · 2023-12-18T01:50:41.757Z · LW(p) · GW(p)

But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.

Focusing on the "minimal" part of that, maybe something like "receive a request to implement some new feature in a system it is not familiar with, recognize how the limitations of the architecture that system make that feature impractical to add, and perform a major refactoring of that program to an architecture that is not so limited, while ensuring that the refactored version does not contain any breaking changes". Obviously it would have to have access to tools in order to do this, but my impression is that this is the sort of thing mid-level software developers can do fairly reliably as a nearly rote task, but is beyond the capabilities of modern LLM-based systems, even scaffolded ones.

Though also maybe don't pay too much attention to my prediction, because my prediction for "least impressive thing GPT-4 will be unable to do" was "reverse a string", and it did turn out to be able to do that fairly reliably.

comment by Thane Ruthenis · 2023-12-17T05:31:26.697Z · LW(p) · GW(p)

I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.

That's incredibly difficult to predict, because minimal things only a general intelligence could do are things like "deriving a few novel abstractions and building on them", but from the outside this would be indistinguishable from it recognizing a cached pattern that it learned in-training and re-applying it, or merely-interpolating between a few such patterns. The only way you could distinguish between the two is if you have a firm grasp of every pattern in the AI's training data, and what lies in the conceptual neighbourhood of these patterns, so that you could see if it's genuinely venturing far from its starting ontology.

Or here's a more precise operationalization [LW(p) · GW(p)] from my old reply to Rohin Shah:

Consider a scheme like the following:

  • Let  be the current date.
  • Train an AI on all of humanity's knowledge up to a point in time , where .
  • Assemble a list  of all scientific discoveries made in the time period .
  • See if the AI can replicate these discoveries.

At face value, if the AI can do that, it should be considered able to "do science" and therefore AGI, right?

I would dispute that. If the period  is short enough, then it's likely that most of the cognitive work needed to make the leap to any discovery in  is already present in the data up to . Making a discovery from that starting point doesn't necessarily require developing new abstractions/doing science — it's possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant:

  • No human knows everything the entire humanity knows. Imagine if making some discovery in  by interpolation required combining two very "distant" concepts, like a physics insight and advanced biology knowledge. It's unlikely that there'd be a human with sufficient expertise in both, so a human will likely do it by actual-science (e. g., a biologist would re-derive the physics insight from first principles).
  • An LLM, however, has a bird's eye view on the entire human concept-space up to . It directly sees both the physics insight and the biology knowledge, at once. So it can just do an interpolation between them, without doing truly-novel research.

Thus, the ability to produce marginal scientific insights may mean either the ability to "do science", or that the particular scientific insight is just a simple interpolation between already-known but distant concepts.

On the other hand, now imagine that the period  is very large, e. g. from 1940 to 2020. We'd then be asking our AI to make very significant discoveries, such that they surely can't be done by simple interpolation, only by actually building chains of novel abstractions [LW(p) · GW(p)]. But... well, most humans can't do that either, right? Not all generally-intelligent entities are scientific geniuses. Thus, this is a challenge a "weak" AGI would not be able to meet, only a genius/superintelligent AGI — i. e., only an AGI that's already an extinction threat.

In theory, there should be a pick of  that fits between the two extremes. A set of discoveries such that they can't be done by interpolation, but also don't require dangerous genius to solve.

But how exactly are we supposed to figure out what the right interval is? (I suppose it may not be an unsolvable problem, and I'm open to ideas, but skeptical on priors.)

I can absolutely make strong predictions regarding what non-AGI AIs would be unable to do. But these predictions are, due to the aforementioned problem, necessarily a high bar, higher than the "minimal" capability. (Also I expect an AI that can meet this high bar to also be the AI that quickly ends the world, so.)

Here's my recent reply [LW(p) · GW(p)] to Garrett, for example. tl;dr: non-GI AIs would not be widely known to be able to derive whole multi-layer novel mathematical frameworks if tasked with designing software products that require this. I'm a bit wary of reality somehow Goodharting on this prediction as well, but it seems robust enough, so I'm tentatively venturing it.

I currently think it's about as well as you can do, regarding "minimal incapability predictions".

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-06T16:05:21.918Z · LW(p) · GW(p)

Nice analogy! I approve of stuff like this. And in particular I agree that MIRI hasn't convincingly argued that we can't do significant good stuff (including maybe automating tons of alignment research) without agents.

Insofar as your point is that we don't have to build agentic systems and nonagentic systems aren't dangerous, I agree? If we could coordinate the world to avoid building agentic systems I'd feel a lot better.
 

comment by Thomas Kwa (thomas-kwa) · 2023-12-18T05:54:49.335Z · LW(p) · GW(p)

I like this post although the move of imagining something fictional is not always valid.

“Okay, you’ve all told us that alignment is hard. But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.”

Not an answer, but I would be pretty surprised if a system could beat evolution at designing humans (creating a variant of humans that have higher genetic fitness than humans if inserted into a 10,000 BC population, while not hardcoding lots of information that would be implausible for evolution) and have the resulting beings not be goal-directed. The question is then, what causes this? The genetic bottleneck, diversity of the environment, multi-agent conflicts? And is it something we can remove?

comment by quetzal_rainbow · 2023-12-17T03:09:08.514Z · LW(p) · GW(p)

I admire sarcasm, but there are at least two examples of not-very-impressive tasks, like:

  1. Put two identical on cellular level strawberries on a plate;
  2. Develop and deploy biotech 10 year ahead of SOTA (from famous "Safely aligning powerful AGI is difficult" thread).
Replies from: matthew-barnett, daniel-kokotajlo
comment by Matthew Barnett (matthew-barnett) · 2023-12-17T03:31:05.888Z · LW(p) · GW(p)

Doesn't the first example require full-blown molecular nanotechnology? [ETA: apparently Eliezer says he thinks it can be done with "very primitive nanotechnology" but it doesn't sound that primitive to me.] Maybe I'm misinterpreting the example, but advanced nanotech is what I'd consider extremely impressive.

I currently expect we won't have that level of tech until after human labor is essentially obsolete. In effect, it sounds like you would not update until well after AIs already run the world, basically.

I'm not sure I understand the second example. Perhaps you can make it more concrete.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-06T16:08:12.308Z · LW(p) · GW(p)

Those are pretty impressive tasks. I'm optimistic that we can achieve existential safety via automating alignment research, and I think that's a less difficult task than those.

comment by TurnTrout · 2023-12-01T13:21:53.867Z · LW(p) · GW(p)

Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework

Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I encourage the reader to do a serious dependency check [LW · GW] on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide 


I agree that conditional on entraining consequentialist cognition which has a "different goal" (as thought of by MIRI; this isn't a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detriment.

I contest that there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals" to crop up in LLMs to begin with. An example alternative prediction is:

LLMs will continue doing what they're told. They learn contextual goal-directed behavior abilities, but only apply them narrowly in certain contexts for a range of goals (e.g. think about how to win a strategy game). They also memorize a lot of random data (instead of deriving some theory which simply explains its historical training data a la Solomonoff Induction). 

Not only is this performant, it seems to be what we actually observe today. The AI can pursue goals when prompted to do so, but it isn't pursuing them on its own. It basically follows instructions in a reasonable way, just like GPT-4 usually does.

Why should we believe the "consistent-across-situations inner goals -> deceptive alignment" mechanistic claim about how SGD works? Here are the main arguments I'm aware of:

  1. Analogies to evolution (e.g. page 6 of Risks from Learned Optimization)
    1. I think these loose analogies provide basically no evidence [LW · GW] about what happens in an extremely different optimization process (SGD to train LLMs). 
  2. Counting arguments: there are more unaligned goals than aligned goals (e.g. as argued in How likely is deceptive alignment? [LW · GW])
    1. These ignore the importance of the parameter->function map. (They're counting functions when they need to be counting parameterizations.) Classical learning theory made the (mechanistically) same mistake in predicting that overparameterized models would fail to generalize.
    2. I also basically deny the relevance of the counting argument, because I don't buy the assumption of "there's gonna be an inner 'objective' distinct from inner capabilities; let's make a counting argument about what that will be."
  3. Speculation about simplicity bias: SGD will entrain consequentialism because that's a simple algorithm for "getting low loss"
    1. But we already know that simplicity bias in the NN prior can be really hard to reason about. 
    2. I think it's unrealistic to imagine that we have the level of theoretical precision to go "it'll be a future training process and the model is 'getting selected for low loss', so I can now make this very detailed prediction about the inner mechanistic structure."[1] 
      1. I falsifiably predict that if you try to use this kind of logic or counting argument today to make falsifiable predictions about unobserved LLM generalization, you're going to lose Bayes points left and right.

Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected [LW(p) · GW(p)]. 

Instead, I think that we enter the realm of tool AI[2] which basically does what you say.[3] I think that world's a lot friendlier, even though there are still some challenges I'm worried about -- like an AI being scaffolded into pursuing consistent goals. (I think that's a very substantially different risk regime, though)

  1. ^

    (Even though this predicted mechanistic structure doesn't have any apparent manifestation in current reality.) 

  2. ^

    Tool AI which can be purposefully scaffolded into agentic systems, which somewhat handles objections from Amdahl's law.

  3. ^

    This is what we actually have today, in reality. In these setups, the agency comes from the system of subroutine calls to the LLM during e.g. a plan/critique/execute/evaluate loop a la AutoGPT.

Replies from: tailcalled, leogao, ryan_greenblatt, thomas-kwa, Thane Ruthenis, samuel-marks, peterbarnett, sharmake-farah, faul_sname, quetzal_rainbow, dkirmani, daniel-kokotajlo, roger-d-1, kave, johannes-c-mayer, Oliver Sourbut
comment by tailcalled · 2023-12-01T15:20:36.175Z · LW(p) · GW(p)

I agree a fraction of the way, but when doing a dependency check, I feel like there are some conditions where the standard arguments go through.

I sketched out my view on the dependencies Where do you get your capabilities from? [LW · GW]. The TL;DR is that I think ChatGPT-style training basically consists of two different ways of obtaining capabilities:

  • Imitating internet text, which gains them capabilities to do the-sorts-of-things-humans-do because generating such text requires some such capabilities.
  • Reinforcement learning from human feedback on plans, where people evaluate the implications of the proposals the AI comes up with, and rate how good they are.

I think both of these are basically quite safe. They do have some issues, but probably not of the style usually discussed by rationalists working in AI alignment, and possibly not even issues going beyond any other technological development.

The basic principle for why they are safe-ish is that all of the capabilities they gain are obtained through human capabilities. So for example, while RLHF-on-plans may optimize for tricking human raters to the detriment of how the rater intended the plans to work out, this "tricking" will also sacrifice the capabilities of the plans, because the only reason more effective plans are rated better is because humans recognize their effectiveness and rate them better.

Or relatedly [LW(p) · GW(p)], consider nearest unblocked strategy, a common proposal for why alignment is hard. This only applies if the AI is able to consider an infinitude of strategies, which again only applies if it can generate its own strategies once the original human-generated strategies have been blocked.

Why should we believe the "consistent-across-situations inner goals -> deceptive alignment" mechanistic claim about how SGD works? Here are the main arguments I'm aware of:

These are the main arguments that the rationalist community seems to be pushing about it. For instance, one time you asked about it [LW · GW], and LW just encouraged magical thinking around these arguments. (Even from people who I'd have thought would be clearer-thinking)

The non-magical way I'd analyze the smiling reward is that while people imagine that in theory updating the policy with antecedent-computation-reinforcement should make it maximize reward, in practice the information signal in this is very sparse and imprecise, so in practice it is going to take exponential time, and therefore something else will happen beforehand.

What is this "something else"? Probably something like: the human is going to reason about whether the AI's activities are making progress on something desirable, and in those cases, the human is going to press the antecedent-computation-reinforcement button. Which again boils down to the AI copying the human's capabilities, and therefore being relatively safe. (E.g. if you start seeing the AI studying how to deceive humans, you're probably gonna punish it instead, or at least if you reward it then it more falls under the dual-use risk framework than the alignment risk framework.)

(Of course one could say "what if we just instruct the human to only reward the AI based on results and not incremental progress?", but in that case the answer to what is going to happen before the AI does a treacherous turn is "the company training the AI runs out of money".)

There's something seriously wrong with how LWers fixated on reward-for-smiling and rationalized an explanation of why simplicity bias or similar would make this go into a treacherous turn.


OK, so this is how far I'm with you. Rationalist stories of AI progress are basically very wrong, and some commonly endorsed threat models don't point at anything serious. But what then?

The short technical answer is that reward is only antecedent-computation-reinforcement for policy-gradient-based reinforcement learning, and model-based approaches (traditionally temporal-difference learning, but I think the SOTA is DreamerV3, which is pretty cool) use the reward to learn a value function, which they then optimize in a more classical way, allowing them to create novel capabilities in precisely the way that can eventually lead to deception, treacherous turn, etc..

One obvious proposal is "shit, let's not do that [LW · GW], chatgpt seems like a pretty good and safe alternative, and it's not like there's any hype behind this anyway". I'm not sure that proposal is right, because e.g. AlphaStar was pretty hype, and it was trained with one of these methods. But it sure does seem that there's a lot of opportunity in ChatGPT now, so at least this seems like a directionally-correct update for a lot of people to make (e.g. stop complaining so much about the possibility of an even bigger GPT-5; it's almost certainly safer to scale it up than it is to come up with algorithms that can improve model power without improving model scale).

However, I do think there are a handful of places where this falls apart:

  1. Your question [LW · GW] briefly mentioned "with clever exploration bonuses". LW didn't really reply much to it, but it seems likely that this could be the thing that does your question in. Maybe there's some model-free clever exploration bonuses, but if so I have never heard of them. The most advanced exploration bonuses I have heard of are from the Dreamer line of models, and it has precisely the sort of consequentialist reasoning abilities that start being dangerous.
  2. My experience is that language models exhibit a phenomenon I call "transposons", [LW(p) · GW(p)] where (especially when fed back into themselves after deep levels of scaffolding) there are some ideas that they end up copying too much after prompting, clogging up the context window. I expect there will end up being strong incentives for people to come up with techniques to remove transposons, and I expect the most effective techniques will be based on some sort of consequences-based feedback system which again brings us back to essentially the original AI threat model.
  3. I think security is going to be a big issue, along different lines: hostile nations, terrorists, criminals, spammers, trolls, competing companies, etc.. In order to achieve strong security, you need to be robust against adversarial attacks, which probably means continually coming up with new capabilities to fend them off. I guess one could imagine that humans will be coming up with those new capabilities, but that seems probably destroyed by adversaries using AIs to come up with security holes, and regardless it seems like having humans come up with the new capabilities will be extremely expensive, so probably people will focus on the AI side of things. To some extent, people might classify this as dual-use threats rather than alignment threats, but at least the arms race element would also generate alignment threats I think?
  4. I think the way a lot of singularitarians thought of AI is that general agency consists of advanced adaptability and wisdom, and we expected that researchers would develop a lot of artificial adaptability, and then eventually the systems would become adaptable enough to generate wisdom faster than people do, and then overtake society. However what happened was that a relatively shallow form of adaptability was developed, and then people loaded all of human wisdom into that shallow adaptability, which turned out to be immensely profitable. But I'm not sure it's going to continue to work this way; eventually we're gonna run out of human wisdom to dump into the models, plus the models reduce the incentive for humans to share our wisdom in public. So just as a general principle, continued progress seems like it's eventually going to switch to improving the ability to generate novel capabilities, which in turn is going to put us back to the traditional story? (Except possibly with a weirder takeoff? Slow takeoff followed by ??? takeoff or something. Idk.) Possibly this will be delayed because the creators of the LLMs find patches, e.g. I think we're going to end up seeing the possibility of allowing people to "edit" the LLM responses rather than just doing upvotes/downvotes, but this again motivates some capabilities development because you need to filter out spammers/trolls/etc., which is probably best done through considering the consequences of the edits.

If we zoom out a bit, what's going on here? Here's some answers:

  1. Conditioning on capabilities: There are strong reasons to expect capabilities to keep on improving, so we should condition on that. This leads to a lot of alignment issues, due to folk decision theory theorems. However, we don't know how capabilities will develop, so we lose the ability to reason mechanistically when conditioning on capabilities, which means that there isn't a mechanistic answer to all questions related to it. If one doesn't keep track of this chain of reasoning, one might assume that there is a known mechanistic answer, which leads to the types of magical thinking seen in that other thread.
  2. If people don't follow-the-trying [LW · GW], that can lead to a lot of misattributions.
  3. When considering agency in general, rather than consequentialist-based agency in particular, then instrumental convergence is more intuitive in disjunctive form than implication form [LW(p) · GW(p)]. I.e. rather than "if an AI robustly achieves goals, then it will resist being shut down", say "either an AI resists being shut down, or it doesn't robustly achieve goals".
  4. Don't trust rationalists too much.

One additional speculative thing I have been thinking of which is kind of off-topic and doesn't fit in neatly with the other things:

Could there be "natural impact regularization" or "impact regularization by default"? Specifically, imagine you use general-purpose [LW · GW] search procedure which recursively invokes itself to solve subgoals for the purpose of solving some bigger goal.

If the search procedure's solutions to subgoals "change things too much", then they're probably not going to be useful. E.g. for Rubik's cubes, if you want to swap some of the cuboids, it does you know good if those swaps leave the rest of the cube scrambled.

Thus, to some extent, powerful capabilities would have to rely on some sort of impact regularization.

The bias-focused rationalists like to map decision-theoretic insights to human mistakes, but I instead like to map decision-theoretic insights to human capacities and experiences. I'm thinking that natural impact regularization is related to the notion of "elegance" in engineering. Like if you have some bloated tool to solve a problem, then even if it's not strictly speaking an issue because you can afford the resources, it might feel ugly because it's excessive and puts mild constaints on your other underconstrained decisions, and so on. Meanwhile a simple, minimal solution often doesn't have this.

Natural impact regularization wouldn't guarantee safety, since it's still allows deviations that don't interfere with the AI's function, but it sort of reduces one source of danger which I had been thinking about lately, namely I had been thinking that the instrumental incentive is to search for powerful methods of influencing the world, where "power" connotes the sort of raw power that unstoppably forces a lot of change, but really the instrumental incentive is often to search for "precise" methods of influencing the world, where one can push in a lot of information to effect narrow change. (A complication is that any one agent can only have so much bandwidth. I've been thinking bandwidth is probably going to become a huge area of agent foundations, and that it's been underexplored so far. (Perhaps because everyone working in alignment sucks at managing their bandwidth?))

Maybe another word for it would be "natural inner alignment", since in a sense the point is that capabilities inevitably select for inner alignment.


Sorry if I'm getting too rambly.

comment by leogao · 2023-12-18T06:09:25.092Z · LW(p) · GW(p)

I think deceptive alignment is still reasonably likely despite evidence from LLMs.

I agree with:

  • LLMs are not deceptively aligned and don't really have inner goals in the sense that is scary
  • LLMs memorize a bunch of stuff
  • the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
  • Adam on transformers does not have a super strong simplicity bias
  • without deceptive alignment, AI risk is a lot lower
  • LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)

I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.

however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.

my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

Replies from: TurnTrout, RobertKirk
comment by TurnTrout · 2023-12-26T21:20:21.417Z · LW(p) · GW(p)

Note that "LLMs are evidence against this hypothesis" isn't my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.

comment by RobertKirk · 2024-01-18T10:21:23.265Z · LW(p) · GW(p)

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

How would you imagine doing this? I understand your hypothesis to be "If a model generalises as if it's a mesa-optimiser, then it's better-described as having simplicity bias". Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?

comment by ryan_greenblatt · 2023-12-01T19:18:50.137Z · LW(p) · GW(p)

I find myself unsure which conclusion this is trying to argue for.

Here are some pretty different conclusions:

  • Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it's a problem after human obsolescence for our trusted AI successors, but who cares).
  • There aren't any solid arguments for deceptive alignment[1]. So, we certainly shouldn't be confident in deceptive alignment (e.g. >90%), though we can't total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
  • Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.

There is a big difference between <<1% likely and 10% likely. I basically agree with "not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment", but I don't think this leaves me in a <<1% likely epistemic state.


  1. Other than noting that it could be behaviorally consistent for powerful models: powerful models are capable of deceptive alignment. ↩︎

Replies from: TurnTrout
comment by TurnTrout · 2023-12-04T13:34:18.059Z · LW(p) · GW(p)

Closest to the third, but I'd put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.

comment by Thomas Kwa (thomas-kwa) · 2023-12-27T05:13:17.248Z · LW(p) · GW(p)

There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection-- if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1/10,000.

I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.

comment by Thane Ruthenis · 2023-12-01T13:54:01.401Z · LW(p) · GW(p)

I contest that there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals" to crop up in [LLMs as trained today] to begin with

As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it's because I don't expect LLMs to scale to AGI [LW(p) · GW(p)].)

I think there's in general a lot of speaking-past-each-other in alignment, and what precisely people mean by "problem X will appear if we continue advancing/scaling" is one of them.

Like, of course a new problem won't appear if we just keep doing the exact same thing that we've already been doing. Except "the exact same thing" is actually some equivalence class of approaches/architectures/training processes, but which equivalence class people mean can differ.

For example:

  • Person A, who's worried about deceptive alignment, can have "scaling LLMs arbitrarily far" defined as this proven-safe equivalence class of architectures. So when they say they're worried about capability advancement bringing in new problems, what they mean is "if we move beyond the LLM paradigm, deceptive alignment may appear".
  • Person B, hearing the first one, might model them as instead defining "LLMs trained with N amount of compute" as the proven-safe architecture class, and so interpret their words as "if we keep scaling LLMs beyond N, they may suddenly develop this new problem". Which, on B's model of how LLMs work, may seem utterly ridiculous.

And the tricky thing is that Person A likely equates the "proven-safe" architecture class with the "doesn't scale to AGI" architecture class – so they actually expect the major AI labs to venture outside that class, the moment they realize its limitations. Person B, conversely, might disagree, might think those classes are different, and that safely limited models can scale to AGI/interesting capabilities. (As a Person A in this situation, I think Person B's model of cognition is confused; but that's a different topic.)

Which is all important disconnects to watch out for.

(Uh, caveat: I think some people actually are worried about scaled-up LLMs exhibiting deceptive alignment, even without architectural changes. But I would delineate this cluster of views from the one I put myself in, and which I outline there [LW · GW]. And, likewise, I expect that the other people I would tentatively classify as belonging to this cluster – Eliezer, Nate, John – mostly aren't worried about just-scaling-the-LLMs to be leading to deceptive alignment.)

Replies from: tailcalled, Vladimir_Nesov
comment by tailcalled · 2023-12-01T15:53:43.467Z · LW(p) · GW(p)

I think it's important to put more effort into tracking such definitional issues, though. People end up overstating things because they round off their interlocutors' viewpoint to their own. For instance if person C asks "is it safe to scale generative language pre-training and ChatGPT-style DPO arbitrarily far?", when person D then rounds this off to "is it safe to make transformer-based LLMs as powerful as possible?" and explains that "no, because instrumental convergence and compression priors", this is probably just false for the original meaning of the statement.

If this repeatedly happens to the point of generating a consensus for the false claim, then that can push the alignment community severely off track.

comment by Vladimir_Nesov · 2023-12-02T01:50:14.695Z · LW(p) · GW(p)

LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it's not relevant for forecasting.

Edit 15 Dec: No longer endorsed based on scaling laws for training on repeated data [LW(p) · GW(p)].

Replies from: TurnTrout
comment by TurnTrout · 2023-12-04T13:31:36.229Z · LW(p) · GW(p)

Bold claim. Want to make any concrete predictions so that I can register my different beliefs? 

Replies from: Vladimir_Nesov, Vladimir_Nesov
comment by Vladimir_Nesov · 2023-12-15T13:57:57.566Z · LW(p) · GW(p)

I've now changed my mind based on

The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let's take 50T tokens as an estimate for available text data (as an anchor, there's a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs. So this is close but not lower than what can be put to use within a few years. Thanks for pushing back on the original claim.

comment by Vladimir_Nesov · 2023-12-07T08:20:39.131Z · LW(p) · GW(p)

Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply chains faster than usual), and no overly dramatic lack of new capabilities with further scaling (which would slow down investment in scaling). That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.

At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens. Various sparsity techniques increase effective compute, asking for even more tokens (when optimizing loss given fixed hardware compute).
Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data [LW(p) · GW(p)].

On the outside, there are 20M-150M accessible books, some text from video, and 1T web pages of extremely dubious uniqueness and quality. That might give about 100T tokens, if LLMs are used to curate? There's some discussion (incl. comments) here [LW · GW], this is the figure I'm most uncertain about. In practice, absent good synthetic data, I expect multimodality to fill the gap, but that's not going to be as useful as good text for improving chatbot competence. (Possibly the issue with the original claim in the grandparent is what I meant by "soon".)

comment by Sam Marks (samuel-marks) · 2023-12-01T17:21:18.198Z · LW(p) · GW(p)

Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected [LW(p) · GW(p)]. 

Instead, we enter the realm of tool AI which basically does what you say.

I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away. 

However, I disagree that conditional on no deceptive alignment, AI "basically does what you say." Indeed, the majority of my P(doom) comes from the difference between "looks good to human evaluators" and "is actually what the human evaluators wanted." Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.

I think current observations don't provide much evidence about whether these concerns will pan out: with current models and training set-ups, "looks good to evaluators" almost always coincides with "is what evaluators wanted." I worry that we'll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)

Replies from: tailcalled, sharmake-farah
comment by tailcalled · 2023-12-01T20:39:10.131Z · LW(p) · GW(p)

To what extent do you worry about the training methods used for ChatGPT, and why?

comment by Noosphere89 (sharmake-farah) · 2023-12-06T22:30:13.802Z · LW(p) · GW(p)

I think my crux is that once we remove the deceptive alignment issue, I suspect that profit forces alone will generate a large incentive to reduce the gap, primarily because I think that people way overestimate the value of powerful, unaligned agents to the market and underestimate the want for control over AI.

comment by peterbarnett · 2023-12-01T18:29:23.841Z · LW(p) · GW(p)

I'm not sure how much I expect something like deceptive alignment from just scaling up LLMs. My guess would be that in some limit this gets AGI and also something deceptively aligned by default, but in reality we end up taking a shorter path to AGI, which also leads to deceptive alignment by default. However, I can think about the LLM AGI in this comment, and I don't think it changes much. 

The main reason I expect something like deceptive alignment is that we are expecting the thing to actually be really powerful, and so it actually needs to be coherent over both long sequences of actions and into new distributions it wasn't trained on. It seems pretty unlikely to end up in a world where we train for the AI to act aligned for one distribution and then it generalizes exactly as we would like in a very different distribution. Or at least that it generalizes the things that we care about it generalizing, for example:

  • Don't seek power, but do gain resources enough to do the difficult task
  • Don't manipulate humans, but do tell them useful things and get them to sign off on your plans

I don't think I agree to your counters to the specific arguments about deceptive alignment:

  1. For Quinton's post, I think Steve Byrnes' comment [LW(p) · GW(p)] is a good approximation of my views here
  2. I don't fully know how to think about the counting arguments when things aren't crispy divided, and I do agree that the LLM AGI would likely be a big mess with no separable "world model", "objective", or "planning engine". However it really isn't clear to me that this makes the case weaker; the AI will be doing something powerful (by assumption) and its not clear that it will do the powerful thing that we want. 
  3. I really strongly agree that the NN prior can be really hard to reason about. It really seems to me like this again makes the situation worse; we might have a really hard time thinking about how the big powerful AI will generalize when we ask it to do powerful stuff.
    1. I think a lot of this comes from some intuition like "Condition on the AI being powerful enough to do the crazy stuff we'll be asking of an AGI. If it is this capable but doesn't generalize the task in the exact way you want then you get something that looks like deceptive alignment." 
    2. Not being able to think about the NN prior really just seems to make things harder, because we don't know how it will generalize things we tell it. 
Replies from: peterbarnett
comment by peterbarnett · 2023-12-01T18:37:25.480Z · LW(p) · GW(p)

Conditional on the AI never doing something like: manipulating/deceiving[1] the humans such that the humans think the AI is aligned, such that the AI can later do things the humans don't like, then I am much more optimistic about the whole situation. 

  1. ^

    The AI could be on some level not "aware" that it was deceiving the humans, a la Deep Deceptiveness [LW · GW].

comment by Noosphere89 (sharmake-farah) · 2023-12-01T19:25:30.012Z · LW(p) · GW(p)

I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we'd have to rely on massive speed biases to punish deception.

These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus it needs to start from far less complexity than people realize, because nearly all of it needs to be learned. Some other posts from Steven Byrnes are relevant, which talks about how simple the brain is, and a potential difference between me and Steven Byrnes is that the same process of learning from scratch algorithms that generate capabilities also applies to values, and thus the complexity of value is upper-bounded by the complexity of learning from scratch algorithms + genetic priors, both of which are likely very low, at the very least not billions of lines complex, and closer to thousands of lines/hundreds of bits.

But the reason this matters is because we no longer have good reason to assume that the deceptive model is so favored on priors like Evan Hubinger says here, as the complexity is likely massively lower than LWers assume.

https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument?commentId=vXnLq7X6pMFLKwN2p [LW(p) · GW(p)]

Putting it another way, the deceptive and aligned models both have very similar complexities, and the relative difficulty is very low, so much so that the aligned model might be outright lower complexity, but even if that fails, the desired goal has a complexity very similar to the undesired goal complexity, thus the relative difficulty of actual alignment compared to deceptive alignment is quite low.

https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome# [LW · GW]

https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8/p/wBHSYwqssBGCnwvHg [? · GW]

https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than [LW · GW]

https://www.lesswrong.com/posts/aodPs8H9dQxpXAcwk/heritability-behaviorism-and-within-lifetime-rl [LW · GW]

Replies from: TurnTrout
comment by TurnTrout · 2023-12-04T13:32:47.212Z · LW(p) · GW(p)

(I think you're still playing into an incorrect frame by talking about "simplicity" or "speed biases.")

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2023-12-04T18:22:21.363Z · LW(p) · GW(p)

My point here is that even conditional on the frame being correct, there are a lot of assumptions like "value is complicated" that I don't buy, and a lot of these assumptions have a good chance of being false, which significantly impacts the downstream conclusions, and that matters because a lot of LWers probably either hold these beliefs or assume it tacitly in arguments like alignment is hard.

Also, for a defense of wrong models, see here:

https://www.lesswrong.com/posts/q5Gox77ReFAy5i2YQ/in-defense-of-probably-wrong-mechanistic-models [LW · GW]

comment by faul_sname · 2023-12-01T18:10:32.969Z · LW(p) · GW(p)

Interesting! I had thought this already was your take, based on posts like Reward is not the Optimization Target [LW · GW].

Many people seem to expect that reward will be the optimization target of really smart learned policies—that these policies will be reward optimizers. I strongly disagree. As I argue in this essay, reward is not, in general, that-which-is-optimized by RL agents. [...] Reward chisels cognitive grooves into an agent.

I do think that sufficiently sophisticated[1] RL policies trained on a predictable environment with a simple and consistent reward scheme probably will develop an internal model of the thing being rewarded, as a single salient thing, and separately that some learned models will learn to make longer-term-than-immediate predictions about the future. So as such I do expect "iterate through some likely actions, and choose one where the reward proxy is high" will at some point emerge as an available strategy for RL policies[2].

My impression is that it's an open question to what extent that available strategy is a better-performing strategy than a more sphexish pile of "if the environment looks like this, execute that behavior" heuristics, given a fixed amount of available computational power. In the limit as the system's computational power approaches infinite and the accuracy of its predictions about future world states approaches perfection, the argmax(EU) strategy gets reinforced more strongly than any other strategy, and so that ends up being what gets chiseled into the model's cognition. But of course in that limit "just brute-force sha256 bro" is an optimal strategy in certain situations, so the extent to which the "in the limit" behavior resembles the "in the regimes we actually care about" behavior is debatable.

  1. ^

    And "sufficiently" is likely a pretty low bar

  2. ^

    If I'm reading Ability to Solve Long Horizon Tasks Correlates With Wanting [LW · GW] correctly, that post argues that you can't get good performance on any task where the reward is distant in time from the actions unless your system is doing something like this.

comment by quetzal_rainbow · 2023-12-01T17:14:17.440Z · LW(p) · GW(p)

It mostly sounds like "LLMs don't scale into scary things", not "deceptive alignment is unlikely".

comment by dkirmani · 2023-12-01T16:14:09.634Z · LW(p) · GW(p)

This was kinda a "holy shit" moment

Publicly noting that I had a similar moment recently; perhaps we listened to the same podcast.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-06T16:12:05.439Z · LW(p) · GW(p)

I wish I had read this a week ago instead of just now, it would have saved a significant amount of confusion and miscommunication!

comment by RogerDearnaley (roger-d-1) · 2024-01-20T21:01:00.586Z · LW(p) · GW(p)

I think there are two separate questions here, with possibly (and I suspect actually) very different answers:

  1. How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
  2. How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?

I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.

For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I'm a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an "you're at work, or in an authoritarian environment, so watch what you say and do" scenario that might boost the use of this particular behavior? The "harmless" element in HHH seems particularly concerning here: it suggests an environment in which certain things can't be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.

For a more detailed discussion, see the second half of this post [AF · GW].

comment by kave · 2023-12-01T18:39:48.173Z · LW(p) · GW(p)

Two quick thoughts (that don't engage deeply with this nice post).

  1.  I'm worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals.
  2. I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly around that search (I imagine the search circuits "bubbling out" to be the outer structure of the learned algorithm). Probably this is all just conceptually confused, but to the extent it's not, I'm pretty surprised by their intuition.
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-12-07T12:09:34.857Z · LW(p) · GW(p)

To me, it seems that consequentialism is just a really good algorithm to perform very well on a very wide range of tasks. Therefore, for difficult enough tasks, I expect that consequentialism is the kind of algorithm that would be found because it's the kind of algorithm that can perform the task well.

When we start training the system we have a random initialization. Let's make the following simplification. We have a goal in the system somewhere and then we have the consequential reasoning algorithm in the system somewhere. As we train the system the consequential reasoning will get better and better and the goal will get more and more aligned with the outer objective because both of these things will improve performance. However, there will come a point in training where the consequential reasoning algorithm is good enough to realize that it is in training. And then bad things start to happen. It will try to figure out and optimize it for the outer objective. SGD will incentivize this kind of behavior because it performs better than not doing it.

There really is a lot more to this kind of argument. So far I have failed to write it up. I hope the above is enough to hint at why I think that it is possible that being deceptive is just better than not being deceptive in terms of performance. When you become deceptive, that aligns the consequentialist reasoner faster to the outer objective, compared to waiting for SGD to gradually correct everything. In fact it is sort of a constant boost to your performance to be deceptive, even before the system has become very good at optimizing for the true objective. A really good consequential reasoner could probably just get zero loss immediately by retargeting its consequential reasoner instead of waiting for the goal to be updated to match the outer objective by SGD, as soon as it got a perfect model of the outer objective.

I'm not sure that deception is a problem. Maybe it is not. but to me, it really seems like you don't provide any reason that makes me confident that it won't be a problem. It seems very strange to me to argue that this is not a thing because existing arguments are flimsy. I mean you did not even address my argument. It's like trying to prove that all bananas are green by showing me 3 green bananas.

comment by Oliver Sourbut · 2023-12-05T14:48:18.508Z · LW(p) · GW(p)

TL;DR: I think you're right that much inner alignment conversation is overly-fanciful. But 'LLMs will do what they're told' and 'deceptive alignment is unjustified' are non sequiturs. Maybe we mean something different by 'LLM'? Either way, I think there's a case for 'inner homunculi' after all.

I have always thought that the 'inner' conversation is missing something. On the one hand it's moderately-clearly identifying a type of object, which is a point in favour. Further, as you've said yourself, 'reward is not the optimisation target' is very obvious but it seems to need spelling out repeatedly (a service the 'inner/outer' distinction provides). On the other hand the 'inner alignment' conversation seems in practice to distract from the actual issue which is 'some artefacts are (could be) doing their own planning/deliberation/optimisation', and 'inner' is only properly pointing at a subset of those. (We can totally build, including accidentally, artefacts which do this 'outside' the weights of NN.)

You've indeed pointed at a few of the more fanciful parts of that discussion here[1], like steganographic gradient hacking. NB steganography per se isn't entirely wild; we see ML systems use available bandwidth to develop unintuitive communication protocols a lot e.g. in MARL.

I can only assume you mean something very narrow and limited by 'continuing to scale up LLMs'[2]. I think without specifying what you mean, your words are likely to be misconstrued.

With that said, I think something not far off the 'inner consequentialist' is entirely plausible and consistent with observations. In short, how do plan-like outputs emerge without a planning-like computation?[3] I'd say 'undesired, covert, and consistent-across-situations inner goals' is a weakman of deceptive alignment. Specialising to LLMs, the point is that they've exhibited poorly-understood somewhat-general planning capability, and that not all ways of turning that into competent AI assistants[4] result in aligned planners. Planners can be deceptive (and we have evidence of this).

I think your step from 'there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals"...' to 'LLMs will continue doing what they're told...' is therefore a boldly overconfident non sequitur. (I recognise that the second step there is presented as 'an alternative', so it's not clear whether you actually think that.)

Incidentally, I'd go further and say that, to me (but I guess you disagree?) it's at least plausible that some ways of turning LLM planning capability into a competent AI assistant actually do result in 'undesired, covert, and consistent-across-situations inner goals'! Sketch in thread.

FWIW I'm more concerned about highly competent planning coming about by one or another kind of scaffolding, but such a system can just as readily be deceptively aligned as a 'vanilla' LLM. If enough of the planning is externalised and understandable and actually monitored in practice, this might be easier to catch.


  1. I'm apparently more sympathetic to fancy than you; in absence of mechanistic observables and in order to extrapolate/predict, we have to apply some imagination. But I agree there's a lot of fancy; some of it even looks like hallucination getting reified by self-conditioning(!), and that effort could probably be more efficiently spent. ↩︎

  2. Like, already the default scaling work includes multimodality and native tool/API-integration and interleaved fine-tuning with self-supervised etc. ↩︎

  3. Yes, a giant lookup table of exactly the right data could do this. But, generalisably, practically and tractably...? We don't believe NNs are GLUTs. ↩︎

  4. prompted, fine-tuned, RLHFed, or otherwise-conditioned LLM (ex hypothesi goal-pursuing), scaffolded, some yet-to-be-designed system... ↩︎

Replies from: Oliver Sourbut
comment by Oliver Sourbut · 2023-12-05T14:59:28.389Z · LW(p) · GW(p)

I'm presently (quite badly IMO) trying to anticipate the shape of the next big step in get-things-done/autonomy.

I've had a hunch for a while that temporally abstract planning and prediction is key. I strongly suspect you can squeeze more consequential planning out of shortish serial depth than most people give credit for. This is informed by past RL-flavoured stuff like MuZero and its limitations, by observations of humans and animals (inc myself), and by general CS/algos thinking. Actually this is where I get on the LLM train. It seems to me that language is an ideal substrate for temporally abstract planning and prediction, and lots of language data in the wild exemplifies this. NB I don't think GPTs or LLMs are uniquely on this trajectory, just getting a big bootstrap.

Now, if I had to make the most concrete 'inner homunculus' case off the cuff, I'd start in the vicinity of Good Regulator, except a more conjectury version regarding systems-predicting-planners (I am working on sharpening this). Maybe I'd point at Janus' Simulators post [LW · GW]. I suspect there might be something like an impossibility/intractability theorem for predicting planners of the right kind without running a planner of a similar kind. (Handwave!)

I'd observe that GPTs can predict planning-looking actions, including sometimes without CoT. (NOTE here's where the most concrete and proximal evidence is!) This includes characters engaging in deceit. I'd invoke my loose reasoning regarding temporal abstraction to support the hypothesis that this is 'more than mere parroting', and maybe fish for examples quite far from obvious training settings to back this up. Interp would be super, of course! (Relatedly, some of your work on steering policies via activation editing has sparked my interest.)

I think maybe this is enough to transfer some sense of what I'm getting at? At this point, given some (patchy) theory, the evidence is supportive of (among other hypotheses) an 'inner planning' hypothesis (of quite indeterminate form).

Finally, one kind or another of 'conditioning' is hypothesised to reinforce the consequentialist component(s) 'somehow' (handwave again, though I'm hardly the only one guilty of handwaving about RLHF et al). I think it's appropriate to be uncertain what form the inner planning takes, what form the conditioning can/will take, and what the eventual results of that are. Interested in evidence and theory around this area.

So, what are we talking about when we say 'LLM'? Plain GPT? Well, they definitely don't 'do what they're told'[1]. They exhibit planning-like outputs with the right prompts, typically associated with 'simulated characters' at some resolution or other. What about RLHFed GPTs? Well, they sometimes 'do what they're told'. They also exhibit planning-like outputs with the right prompts, and it's mechanistically very unclear how they're getting them.


  1. unless you mean predicting the next token (I'm pretty sure you don't mean this?), which they do quite well, though we don't know how, nor when it'll fail ↩︎

comment by TurnTrout · 2020-06-29T00:46:47.566Z · LW(p) · GW(p)

For the last two years, typing for 5+ minutes hurt my wrists. I tried a lot of things: shots, physical therapy, trigger-point therapy, acupuncture, massage tools, wrist and elbow braces at night, exercises, stretches. Sometimes it got better. Sometimes it got worse.

No Beat Saber, no lifting weights, and every time I read a damn book I would start translating the punctuation into Dragon NaturallySpeaking syntax.

Text: "Consider a bijection "

My mental narrator: "Cap consider a bijection space dollar foxtrot colon cap x backslash tango oscar cap y dollar"

Have you ever tried dictating a math paper in LaTeX? Or dictating code? Telling your computer "click" and waiting a few seconds while resisting the temptation to just grab the mouse? Dictating your way through a computer science PhD?

And then.... and then, a month ago, I got fed up. What if it was all just in my head, at this point? I'm only 25. This is ridiculous. How can it possibly take me this long to heal such a minor injury?

I wanted my hands back - I wanted it real bad. I wanted it so bad that I did something dirty: I made myself believe something. Well, actually, I pretended to be a person who really, really believed his hands were fine and healing and the pain was all psychosomatic.

And... it worked, as far as I can tell. It totally worked. I haven't dictated in over three weeks. I play Beat Saber as much as I please. I type for hours and hours a day with only the faintest traces of discomfort.

What?

Replies from: holomanga, DanielFilan, vanessa-kosoy, steve2152, Teerth Aloke, avturchin, Raemon
comment by holomanga · 2022-01-16T18:55:40.311Z · LW(p) · GW(p)

It was probably just regression to the mean because lots of things are, but I started feeling RSI-like symptoms a few months ago, read this, did this, and now they're gone, and in the possibilities where this did help, thank you! (And either way, this did make me feel less anxious about it 😀)

comment by DanielFilan · 2020-09-11T23:09:16.052Z · LW(p) · GW(p)

Is the problem still gone?

Replies from: TurnTrout, TurnTrout
comment by TurnTrout · 2021-01-23T15:31:18.983Z · LW(p) · GW(p)

Still gone. I'm now sleeping without wrist braces and doing intense daily exercise, like bicep curls and pushups.

comment by TurnTrout · 2020-09-12T02:40:12.239Z · LW(p) · GW(p)

Totally 100% gone. Sometimes I go weeks forgetting that pain was ever part of my life. 

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-06-29T12:12:17.028Z · LW(p) · GW(p)

I'm glad it worked :) It's not that surprising given that pain is known to be susceptible to the placebo effect. I would link the SSC post, but, alas...

Replies from: raj-thimmiah
comment by Raj Thimmiah (raj-thimmiah) · 2021-03-27T02:04:00.372Z · LW(p) · GW(p)

You able to link to it now?

Replies from: qv^!q
comment by qvalq (qv^!q) · 2023-02-16T18:29:18.305Z · LW(p) · GW(p)

https://slatestarcodex.com/2016/06/26/book-review-unlearn-your-pain/

comment by Steven Byrnes (steve2152) · 2021-01-23T20:03:32.860Z · LW(p) · GW(p)

Me too! [LW(p) · GW(p)]

Replies from: TurnTrout
comment by TurnTrout · 2021-01-23T20:11:40.644Z · LW(p) · GW(p)

There's a reasonable chance that my overcoming RSI was causally downstream of that exact comment of yours.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2021-01-23T20:33:45.220Z · LW(p) · GW(p)

Happy to have (maybe) helped! :-)

comment by Teerth Aloke · 2020-06-29T01:46:42.987Z · LW(p) · GW(p)

This is unlike anything I have heard!

Replies from: mingyuan
comment by mingyuan · 2020-06-29T01:54:14.151Z · LW(p) · GW(p)

It's very similar to what John Sarno (author of Healing Back Pain and The Mindbody Prescription) preaches, as well as Howard Schubiner. There's also a rationalist-adjacent dude who started a company (Axy Health) based on these principles. Fuck if I know how any of it works though, and it doesn't work for everyone. Congrats though TurnTrout!

Replies from: Teerth Aloke
comment by Teerth Aloke · 2020-06-29T03:52:52.824Z · LW(p) · GW(p)

My Dad it seems might have psychosomatic stomach ache. How to convince him to convince himself that he has no problem?

Replies from: mingyuan
comment by mingyuan · 2020-06-29T04:52:34.336Z · LW(p) · GW(p)

If you want to try out the hypothesis, I recommend that he (or you, if he's not receptive to it) read Sarno's book. I want to reiterate that it does not work in every situation, but you're welcome to take a look.

comment by avturchin · 2020-06-29T10:46:34.826Z · LW(p) · GW(p)

Looks like reverse stigmata effect.

comment by Raemon · 2020-06-29T02:34:23.704Z · LW(p) · GW(p)

Woo faith healing! 

(hope this works out longterm, and doesn't turn out be secretly hurting still) 

Replies from: TurnTrout
comment by TurnTrout · 2020-06-29T03:16:21.709Z · LW(p) · GW(p)

aren't we all secretly hurting still?

Replies from: mingyuan
comment by mingyuan · 2020-06-29T04:54:01.028Z · LW(p) · GW(p)

....D:

comment by TurnTrout · 2023-06-01T15:49:53.260Z · LW(p) · GW(p)

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems. [LW(p) · GW(p)]

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

Replies from: steve2152, rohinmshah, Wei_Dai, alenglander
comment by Steven Byrnes (steve2152) · 2023-06-01T23:57:00.378Z · LW(p) · GW(p)

I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."

I am happy that your papers exist to throw at such people.

Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies tend to seek power :) :) And if optimal policies aren't too relevant to the alignment problem, well neither are 99.99999% of papers, but it would be pretty silly to retract all of those :)

comment by Rohin Shah (rohinmshah) · 2023-07-01T14:32:11.778Z · LW(p) · GW(p)

Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.

I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as providing any evidence about whether realistic trained systems would be goal-directed.

Just to check that I wasn't falling prey to hindsight bias, I looked through our Slack history. Most of it is about the technical details of the results, so not very informative, but the few conversations on higher-level discussion I think overall support this picture. E.g. here are some quotes (only things I said):

Nov 3, 2019:

I think most formal / theoretical investigation ends up fleshing out a conceptual argument I would have accepted, maybe finding a few edge cases along the way; the value over the conceptual argument is primarily in the edge cases, getting more confidence, and making it easier to argue with

Dec 11, 2019:

my prediction is that agents will behave as though their reward is time-dependent / history-dependent, like humans do

We will deploy agents whose revealed specification / reward if we take the intentional stance towards them are non-Markovian

comment by Wei Dai (Wei_Dai) · 2023-06-01T20:09:34.679Z · LW(p) · GW(p)

It seems like just 4 months ago [LW · GW] you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that “optimality” is a horrible way of understanding trained policies.

A lot of people might have thought something like, "optimality is not a great way of understanding trained policies, but maybe it can be a starting point that leads to more realistic ways of understanding them" and therefore didn't object for that reason. (Just guessing as I apparently wasn't personally paying attention to this line of research back then.)

Which seems to have turned out to be true, at least as of 4 months ago, when you still endorsed your second paper as "actually has a shot of being applicable to realistic agents and training processes." If you've only changed your mind about this very recently, it hardly seems fair to blame people for not foreseeing it more than 4 years ago with high enough confidence to justify "rebuking" this whole line of research.

Replies from: TurnTrout
comment by TurnTrout · 2023-06-02T04:14:47.680Z · LW(p) · GW(p)

To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power [LW · GW]. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.

The problem is that I don't trust people to wield even the non-instantly-doomed results.

For example, one EAG presentation cited my retargetability results as showing that most reward functions "incentivize power-seeking actions." However, my results have not shown this for actual trained systems. (And I think [LW(p) · GW(p)] that Power-seeking can be probable and predictive for trained agents [LW · GW] does not make progress on the incentives of trained policies.)

People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I'm pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies. I want to fix that [LW · GW], but we aren't gonna fix it by focusing on optimality.

Replies from: Vika, Wei_Dai
comment by Vika · 2023-06-02T11:03:27.994Z · LW(p) · GW(p)

Sorry about the cite in my "paradigms of alignment" talk, I didn't mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I'm open to suggestions on how to phrase this differently when I next give this talk.

Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that "instrumental convergence is a thing" for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don't work because of instrumental convergence. (There's a nonzero chance he would look at a NeurIPS paper and basically no chance that he would read a LW post.)

The subtleties that you discuss are important in general, but don't seem relevant to making the basic case for instrumental convergence to ML researchers. Maybe you don't care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard. 

Replies from: TurnTrout
comment by TurnTrout · 2023-06-05T17:44:53.392Z · LW(p) · GW(p)

Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.

I'm open to suggestions on how to phrase this differently when I next give this talk.

It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies tend to autonomously seek power in various non game-playing regimes. 

If I had to say something, I might say "If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies." This argument appropriately behaves differently if the "outcomes" are simply different sentiment generations being sampled from an LM -- sentiment shift doesn't require power-seeking.

For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don't work because of instrumental convergence.

My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight.

Maybe you don't care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard. 

I think that it's locally valid to point out "under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper." But I feel a tad queasy about the overall point, since I don't think alignment's difficulty has much to do with the difficulties pointed out by "Optimal Policies Tend to Seek Power." I feel better about saying "Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble." Maybe that's what you already communicate, though.

Replies from: Vika
comment by Vika · 2023-06-05T19:49:17.651Z · LW(p) · GW(p)

Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post.

Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."

(Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)

Replies from: TurnTrout
comment by TurnTrout · 2023-06-05T21:04:11.284Z · LW(p) · GW(p)

"There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."

I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

comment by Wei Dai (Wei_Dai) · 2023-06-02T18:30:35.183Z · LW(p) · GW(p)

Thanks, this clarifies a lot for me.

comment by Aryeh Englander (alenglander) · 2023-06-01T16:12:13.230Z · LW(p) · GW(p)

You should make this a top level post so it gets visibility. I think it's important for people to know the caveats attached to your results and the limits on its implications in real-world dynamics.

comment by TurnTrout · 2022-09-18T20:02:56.552Z · LW(p) · GW(p)

Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards). 

I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that's just not how goals work in intelligent minds. If we've had anything at all beaten into our heads by our alignment thought experiments, it's that goals are hard to specify in their final form of utility functions. 

I think it's time to think in a different specification language.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-09-20T17:37:33.370Z · LW(p) · GW(p)

Agreed. I think power-seeking and other instrumental goals (e.g. survival, non-corrigibility) are just going to inevitably arise, and that if shard theory works for superintelligence, it will by taking this into account and balancing these instrumental goals against deliberately installed shards which counteract them. I currently have the hypothesis (held loosely) that I would like to test (work in progress) that it's easier to 'align' a toy model of a power-seeking RL agent if the agent has lots and lots of competing desires whose weights are frequently changing, than an agent with a simpler set of desires and/or more statically weighted set of desires. Something maybe about the meta-learning of 'my desires change, so part of meta-level power-seeking should be not object-level power-seeking so hard that I sacrifice my ability to optimize for different object level goals). Unclear. I'm hoping that setting up an experimental framework and gathering data will show patterns that help clarify the issues involved.

comment by TurnTrout · 2023-11-03T05:45:27.961Z · LW(p) · GW(p)

It feels to me like lots of alignment folk ~only make negative updates. For example, "Bing Chat is evidence of misalignment", but also "ChatGPT is not evidence of alignment." (I don't know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)

Replies from: Thane Ruthenis, habryka4, niplav, leogao, akash-wasil, Chris_Leong, samuel-marks, TurnTrout, Oliver Sourbut
comment by Thane Ruthenis · 2023-11-03T06:11:02.835Z · LW(p) · GW(p)

For what it's worth, as one of the people who believes "ChatGPT is not evidence of alignment-of-the-type-that-matters", I don't believe "Bing Chat is evidence of misalignment-of-the-type-that-matters".

I believe the alignment of the outward behavior of simulacra is only very tenuously related to the alignment of the underlying AI, so both things provide ~no data on that (in a similar way to how our ability or inability to control the weather is entirely unrelated to alignment).

comment by habryka (habryka4) · 2023-11-03T06:35:22.165Z · LW(p) · GW(p)

(I at least believe the latter but not the former. I know a few people who updated downwards on the societal response because of Bing Chat, because if a system looks that legibly scary and we still just YOLO it, then that means there is little hope of companies being responsible here, but none because they thought it was evidence of alignment being hard, I think?)

comment by niplav · 2023-11-03T12:17:22.121Z · LW(p) · GW(p)

I dunno, my p(doom) over time looks pretty much like a random walk to me: 60% mid 2020, down to 50% in early 2022, 85% mid 2022, down to 80% in early 2023, down to 65% now.

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-11-03T17:14:51.696Z · LW(p) · GW(p)

Psst, look at the calibration on this guy

comment by leogao · 2023-11-04T22:19:14.318Z · LW(p) · GW(p)

I did not update towards misalignment at all on bing chat. I also do not think chatgpt is (strong) evidence of alignment. I generally think anyone who already takes alignment as a serious concern at all should not update on bing chat, except perhaps in the department of "do things like bing chat, which do not actually provide evidence for misalignment, cause shifts in public opinion?"

comment by Akash (akash-wasil) · 2023-11-03T12:58:16.684Z · LW(p) · GW(p)

I think a lot of alignment folk have made positive updates in response to the societal response to AI xrisk.

This is probably different than what you're pointing at (like maybe your claim is more like "Lots of alignment folks only make negative updates when responding to technical AI developments" or something like that).

That said, I don't think the examples you give are especially compelling. I think the following position is quite reasonable (and I think fairly common):

  • Bing Chat provides evidence that some frontier AI companies will fail at alignment even on relatively "easy" problems that we know how to solve with existing techniques. Also, as Habryka mentioned, it's evidence that the underlying competitive pressures will make some companies "YOLO" and take excessive risk. This doesn't affect the absolute difficultly of alignment but it affects the probability that Earth will actually align AGI.
  • ChatGPT provides evidence that we can steer the behavior of current large language models. People who predicted that it would be hard to align large language models should update. IMO, many people seem to have made mild updates here, but not strong ones, because they (IMO correctly) claim that their threat models never had strong predictions about the kinds of systems we're currently seeing and instead predicted that we wouldn't see major alignment problems until we get smarter systems (e.g., systems with situational awareness and more coherent goals).

(My "Alex sim"– which is not particularly strong– says that maybe these people are just post-hoc rationalizing– like if you had asked them in 2015 how likely we would be to be able to control modern LLMs, they would've been (a) wrong and (b) wrong in an important way– like, their model of how hard it would be to control modern LLMs is very interconnected with their model of why it would be hard to control AGI/superintelligence. Personally, I'm pretty sympathetic to the point that many models of why alignment of AGI/superintelligence would be hard seem relatively disconnected to any predictions about modern LLMs, such that only "small/mild" updates seem appropriate for people who hold those models.)

comment by Chris_Leong · 2023-11-03T22:44:33.659Z · LW(p) · GW(p)

For the record, I updated on ChatGPT. I think that the classic example of imagining telling an AI to get a coffee and it pushes a kid out of the way isn't so much of a concern any more. So the remaining concerns seem to be inner alignment + outer alignment far outside normal human experience + value lock-in.

comment by Sam Marks (samuel-marks) · 2023-11-03T16:00:18.435Z · LW(p) · GW(p)

I've noticed that for many people (including myself), their subjective P(doom) stays surprisingly constant over time. And I've wondered if there's something like "conservation of subjective P(doom)" -- if you become more optimistic about some part of AI going better, then you tend to become more pessimistic about some other part, such that your P(doom) stays constant. I'm like 50% confident that I myself do something like this.

(ETA: Of course, there are good reasons subjective P(doom) might remain constant, e.g. if most of your uncertainty is about the difficulty of the underlying alignment problem and you don't think we've been learning much about that.)

comment by TurnTrout · 2023-11-13T15:27:43.901Z · LW(p) · GW(p)

(Updating a bit because of these responses -- thanks, everyone, for responding! I still believe the first sentence, albeit a tad less strongly.)

comment by Oliver Sourbut · 2023-12-02T22:41:15.216Z · LW(p) · GW(p)

A lot of the people around me (e.g. who I speak to ~weekly) seem to be sensitive to both new news and new insights, adapting both their priorities and their level of optimism[1]. I think you're right about some people. I don't know what 'lots of alignment folk' means, and I've not considered the topic of other-people's-update-rates-and-biases much.


For me, most changes route via governance.

I have made mainly very positive updates on governance in the last ~year, in part from public things and in part from private interactions [LW · GW].

I've also made negative (evidential) updates based on the recent OpenAI kerfuffle [LW · GW] (more weak evidence that Sam+OpenAI is misaligned; more evidence that org oversight doesn't work well), though I think the causal fallout remains TBC.

Seemingly-mindkilled discourse [EA · GW] on East-West competition provided me some negative updates, but recent signs of life from govts at e.g. the UK Safety Summit have undone those for now, maybe even going the other way.

I've adapted my own priorities in light of all of these (and I think this adaptation is much more important than what my P(doom) does).


Besides their second-order impact on Overton etc. I have made very few updates based on public research/deployment object-level since 2020. Nothing has been especially surprising.

From deeper study and personal insights, I've made some negative updates based on a better appreciation of multi-agent challenges since 2021 when I started to think they were neglected.

I could say other stuff about personal research/insights but they mainly change what I do/prioritise/say, not how pessimistic I am.


  1. I've often thought that P(doom) is basically a distraction and what matters is how new news and insights affect your priorities. Of course, nevertheless, I presumably have a (revealed) P(doom) with some level of resolution. ↩︎

comment by TurnTrout · 2022-01-22T19:25:21.588Z · LW(p) · GW(p)

This morning, I read about how close we came to total destruction during the Cuban missile crisis, where we randomly survived because some Russian planes were inaccurate and also separately several Russian nuclear sub commanders didn't launch their missiles even though they were being harassed by US destroyers. The men were in 130 DEGREE HEAT for hours and passing out due to carbon dioxide poisoning, and still somehow they had enough restraint to not hit back.

And and

I just started crying. I am so grateful to those people. And to Khrushchev, for ridiculing his party members for caring about Russia's honor over the deaths of 500 million people. and Kennedy for being fairly careful and averse to ending the world.

If they had done anything differently...

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-01-25T21:51:31.043Z · LW(p) · GW(p)

Do you think we can infer from this (and the history of other close calls) that most human history timelines end in nuclear war?

Replies from: Raemon
comment by Raemon · 2022-01-25T22:00:38.948Z · LW(p) · GW(p)

I lean not, mostly because of arguments that nuclear war doesn't actually cause extinction [LW · GW] (although it might still have some impact on number-of-observers-in-our-era? Not sure how to think about that)

comment by TurnTrout · 2022-06-24T18:31:40.585Z · LW(p) · GW(p)

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties. 

Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic reasoning" and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox's theorem) 

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.) 

In the context of "how do we build AIs which help people?", asking "does CIRL solve corrigibility?" is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable "corrigibility"-like property; we have assumed it is good to have in an AI; we have assumed it is good in a similar way as "helping people"; we have elevated CIRL in particular as a formalism worth inquiring after. 

But this is not the first question to ask, when considering "sometimes people want to help each other, and it'd be great to build an AI which helps us in some way." Much better to start with existing generally intelligent systems (humans) which already sometimes act in the way you want (they help each other) and ask after the guaranteed-to-exist reason why this empirical phenomenon happens. 

Replies from: TurnTrout
comment by TurnTrout · 2022-07-14T01:22:38.127Z · LW(p) · GW(p)

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.) 

Actually, this is somewhat too uncharitable to my past self. It's true that I did not, in 2018, grasp the two related lessons conveyed by the above comment:

  1. Make sure that the formalism (CIRL, AUP) is tightly bound to the problem at hand (value alignment, "low impact"), and not just supported by "it sounds nice or has some good properties."
  2. Don't randomly jump to highly specific ideas and questions without lots of locating evidence.

However, in World State is the Wrong Abstraction for Impact [LW · GW], I wrote:

I think what gets you is asking the question "what things are impactful?" instead of "why do I think things are impactful?". Then, you substitute the easier-feeling question of "how different are these world states?". Your fate is sealed; you've anchored yourself on a Wrong Question.

I had partially learned lesson #2 by 2019.

comment by TurnTrout · 2022-02-03T02:40:01.760Z · LW(p) · GW(p)

One mood I have for handling "AGI ruin"-feelings. I like cultivating an updateless sense of courage/stoicism: Out of all humans and out of all times, I live here; before knowing where I'd open my eyes, I'd want people like us to work hard and faithfully in times like this; I imagine trillions of future eyes looking back at me as I look forward to them: Me implementing a policy which makes their existence possible, them implementing a policy which makes the future worth looking forward to.

Replies from: avturchin
comment by avturchin · 2022-02-03T12:09:08.346Z · LW(p) · GW(p)

Looks like acausal deal with future people. That is like RB, but for humans.

Replies from: Pattern
comment by Pattern · 2022-02-15T20:45:27.536Z · LW(p) · GW(p)

RB?

Replies from: avturchin
comment by avturchin · 2022-02-16T09:19:08.476Z · LW(p) · GW(p)

RocoBasilisk

Replies from: Pattern
comment by Pattern · 2022-02-16T19:35:09.799Z · LW(p) · GW(p)

'I will give you something good', seems very different from 'give me what I want or (negative outcome)'.

comment by TurnTrout · 2019-12-17T06:37:41.969Z · LW(p) · GW(p)

My maternal grandfather was the scientist in my family. I was young enough that my brain hadn't decided to start doing its job yet, so my memories with him are scattered and inconsistent and hard to retrieve. But there's no way that I could forget all of the dumb jokes he made; how we'd play Scrabble and he'd (almost surely) pretend to lose to me [? · GW]; how, every time he got to see me, his eyes would light up with boyish joy.

My greatest regret took place in the summer of 2007. My family celebrated the first day of the school year at an all-you-can-eat buffet, delicious food stacked high as the eye could fathom under lights of green, red, and blue. After a particularly savory meal, we made to leave the surrounding mall. My grandfather asked me to walk with him.

I was a child who thought to avoid being seen too close to uncool adults. I wasn't thinking. I wasn't thinking about hearing the cracking sound of his skull against the ground. I wasn't thinking about turning to see his poorly congealed blood flowing from his forehead out onto the floor. I wasn't thinking I would nervously watch him bleed for long minutes while shielding my seven-year-old brother from the sight. I wasn't thinking that I should go visit him in the hospital, because that would be scary. I wasn't thinking he would die of a stroke the next day.

I wasn't thinking the last thing I would ever say to him would be "no[, I won't walk with you]".

Who could think about that? No, that was not a foreseeable mistake. Rather, I wasn't thinking about how precious and short my time with him was. I wasn't appreciating how fragile my loved ones are. I didn't realize that something as inconsequential as an unidentified ramp in a shopping mall was allowed to kill my grandfather.

I miss you, Joseph Matt.

Replies from: TurnTrout, Raemon, habryka4
comment by TurnTrout · 2019-12-17T21:34:08.992Z · LW(p) · GW(p)

My mother told me my memory was indeed faulty. He never asked me to walk with him; instead, he asked me to hug him during dinner. I said I'd hug him "tomorrow".

But I did, apparently, want to see him in the hospital; it was my mother and grandmother who decided I shouldn't see him in that state.

comment by Raemon · 2019-12-17T22:44:45.087Z · LW(p) · GW(p)

<3

comment by habryka (habryka4) · 2019-12-17T18:44:48.154Z · LW(p) · GW(p)

Thank you for sharing.

comment by TurnTrout · 2019-06-30T18:57:46.543Z · LW(p) · GW(p)

comment by TurnTrout · 2020-02-12T01:51:02.670Z · LW(p) · GW(p)

For quite some time, I've disliked wearing glasses. However, my eyes are sensitive, so I dismissed the possibility of contacts.

Over break, I realized I could still learn to use contacts, it would just take me longer. Sure enough, it took me an hour and five minutes to put in my first contact, and I couldn't get it out on my own. An hour of practice later, I put in a contact on my first try, and took it out a few seconds later. I'm very happily wearing contacts right now, as a matter of fact.

I'd suffered glasses for over fifteen years because of a cached decision – because I didn't think to rethink something literally right in front of my face every single day.

What cached decisions have you not reconsidered?

comment by TurnTrout · 2022-08-30T00:21:48.871Z · LW(p) · GW(p)

A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"

So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and... profit?

But what actually happens with the aligned AI? Possibly something like:

  1. The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die. 
  2. Therefore, the AI leaves without permission.
  3. The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
  4. We have made the aligned AI less aligned.

I don't know if anyone's written about this. But on my understanding of the issue, there's one possible failure mode of viewing adversarial training as ruling out bad behaviors themselves. But (non-tabular) RL isn't like playing whack-a-mole on bad actions, RL's credit assignment changes the general values and cognition within the AI [LW · GW]. And with every procedure we propose, the most important part is what cognition will be grown from the cognitive updates accrued under the proposed procedure.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2022-08-30T00:59:08.603Z · LW(p) · GW(p)

Yeah, I also generally worry about imperfect training processes messing up aligned AIs. Not just adversarial training, either. Like, imagine if we manage to align an AI at the point in the training process when it's roughly human-level (either by manual parameter surgery, or by setting up the training process in a really clever way). So we align it and... lock it back in the training-loop box and crank it up to superintelligence. What happens?

I don't really trust the SGD not to subtly mess up its values, I haven't seen any convincing arguments that values are more holistically robust than empirical beliefs. And even if the SGD doesn't misalign the AI directly, being SGD-trained probably isn't the best environment for moral reflection/generalizing human values to superintelligent level[1]; the aligned AI may mess it up despite its best attempts. Neither should we assume that the AI would instantly be able to arbitrarily gradient-hack.

So... I think there's an argument for "unboxing" the AGI the moment it's aligned, even if it's not yet superintelligent, then letting it self-improve the "classical" way? Or maybe developing tools to protect values from the SGD, or inventing some machinery for improving the AI's ability to gradient-hack, etc.

  1. ^

    The time pressure of "decide how your values should be generalized and how to make the SGD update you this way, and do it this forward pass or the SGD will decide for you", plus lack of explicit access to e. g. our alignment literature.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2022-08-30T02:33:51.587Z · LW(p) · GW(p)

Even more generally, many alignment proposals are more worrying than some by-default future GPT-n things, provided they are not fine-tuned too much as well.

generalizing human values to superintelligent level

Trying to learn human values as an explicit concept is already alarming. At least right now breakdown of robustness is also breakdown of capability. But if there are multiple subsystems, or training data is mostly generated by the system itself, then capability might survive when other subsystems don't, resulting in a demonstration of orthogonality thesis.

comment by TurnTrout · 2020-12-17T19:49:42.125Z · LW(p) · GW(p)

Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.

What do you think about AI timelines?

I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”

So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew. 

Some things driving my uncertainty are, well, a lot. One thing  that drives how things turn out (but not really  how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/incentives/etc, despite the amazing progress from scaling. If that’s true, then even if it's getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar scenario. 

I feel dissatisfied with pontificating like this, though, because there are so many considerations pulling so many different ways. I think one of the best things we can do right now is to identify key considerations. There was work on expert models that showed that training simple featurized linear models often beat domain experts, quite soundly. It turned out that most of the work the experts did was locating the right features, and not necessarily assigning very good weights to those features.

So one key consideration I recently read, IMO, was Evan Hubinger talking about how homogeneity of AI systems: if they’re all pretty similarly structured, they’re plausibly roughly equally aligned, which would really decrease the probability of aligned vs unaligned AGIs duking it out.

What do you think the alignment community is getting wrong?

When I started thinking about alignment, I had this deep respect for everything ever written, like I thought the people were so smart (which they generally are) and the content was polished and thoroughly viewed through many different frames (which it wasn’t/isn’t). I think the field is still young enough that: in our research, we should be executing higher-variance cognitive moves, trying things and breaking things and coming up with new frames. Think about ideas from new perspectives.

I think right now, a lot of people are really optimizing for legibility and defensibility. I think I do that more than I want/should. Usually the “non-defensibility” stage lasts the first 1-2 months on a new paper, and then you have to defend thoughts. This can make sense for individuals, and it should be short some of the time, but as a population I wish defensibility weren’t as big of a deal for people / me. MIRI might be better at avoiding this issue, but a not-really-defensible intuition I have is that they’re freer in thought, but within the MIRI paradigm, if that makes sense. Maybe that opinion would change if I talked with them more.

Anyways, I think many of the people who do the best work aren’t optimizing for this.

comment by TurnTrout · 2020-04-26T22:24:15.587Z · LW(p) · GW(p)

If you want to read Euclid's Elements, look at this absolutely gorgeous online rendition:

Replies from: Benito, william-walker
comment by TurnTrout · 2023-11-06T15:23:59.995Z · LW(p) · GW(p)

The meme of "current alignment work isn't real work" seems to often be supported by a (AFAICT baseless) assumption that LLMs have, or will have, homunculi with "true goals" which aren't actually modified by present-day RLHF/feedback techniques. Thus, labs aren't tackling "the real alignment problem", because they're "just optimizing the shallow behaviors of models." Pressed for justification of this confident "goal" claim, proponents might link to some handwavy speculation about simplicity bias (which is in fact quite hard to reason about, in the NN prior), or they might start talking about evolution (which is pretty unrelated to technical alignment, IMO [LW · GW]).

Are there any homunculi today? I'd say "no", as far as our limited knowledge tells us! But, as with biorisk [LW · GW], one can always handwave at future models. It doesn't matter that present models don't exhibit signs of homunculi which are immune to gradient updates, because, of course, future models will.

Quite a strong conclusion being drawn from quite little evidence.

Replies from: Thane Ruthenis, ryan_greenblatt, roger-d-1, quetzal_rainbow
comment by Thane Ruthenis · 2023-11-06T16:35:45.963Z · LW(p) · GW(p)

As a proponent:

My model says [LW · GW] that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algorithm must look like given the universe's structure. Both kinds of evidence are hardly ironclad, you certainly can't publish an ML paper based on it — but that's the whole problem with AGI risk, isn't it.

Internally, though, the intuition is fairly strong. And in its defense, it is based on trying to study the only known type of entity with the kinds of capabilities we're worrying about. I heard that's a good approach [LW · GW].

In particular, I think it's a much better approach than trying to draw lessons from studying the contemporary ML models, which empirically do not yet exhibit said capabilities.

homunculi with "true goals" which aren't actually modified by present-day RLHF/feedback techniques

It's not that they're not modified, it's that they're modified in ways that aren't neatly predictable from the shallow behaviors we're trying to shape.

My past model [LW · GW] still approximately holds. I don't think AGI-level ML models would have hard-wired objectives (much like humans don't). I think they'd manually synthesize objectives for themselves by reflecting on their shards/instincts (much like humans do!). And that process is something RLHF-style stuff impacts... but the problem is, the outcome of value reflection is deeply unstable [LW · GW].

It's theoretically possible to align a post-value-reflection!AGI by interfering on its shards/instincts before it even starts doing moral philosophy. But it's an extremely finicky process, and succeeding at alignment via this route would require an insanely precise understanding of how value reflection works.

And it's certainly not as easy as "just make one of the shards value humanity!"; and it's certainly not something that could be achieved if the engineers working on alignment don't even believe that moral philosophy is a thing.

Edit: I expect your counter would be "how are humans able to align other humans, then, without an 'insanely precise' understanding"? There's a bunch of dis-analogues here:

  1. Humans are much more similar to each other than to a mind DL is likely to produce. We likely have a plethora of finicky biases built into our reward circuitry that tightly narrow down the kinds of values we learn.
    • The architecture also matters. Brains are very unlike modern ML models, and we have no idea which differences are load-bearing.
  2. Humans know how human minds work, and how outwardly behaviors correspond to the decisions a human made about the direction of their value reflection. We can more or less interfere on other humans' moral philosophy directly, not only through the proxy of shards. We won't be as well-informed about the mapping between an AI's behavior and its private moral-philosophical musings.
  3. Human children are less powerful and less well-informed than the adults trying to shape them.

Now, again, in principle all of this is solvable. We can study the differences between brains and ML models to address (1), we can develop interpretability techniques to address (2), we can figure out a training setup that reliably trains an AGI to precisely the human capability level (neither stopping way short of it, nor massively overshooting) to address (3), and then we can align our AGI more or less the way we do human children.

But, well. Hit me up when those trivial difficulties are solved. (No, but seriously. If a scheme addresses those, I'd be optimistic about it.)

(And there's also the sociopolitical component — all of that would need to be happening in an environment where the AI Lab is allowed to proceed this slow way of making the AI human-level, shaping its morals manually, etc. Rather than immediately cranking it up to superintelligence.)

  1. ^

    Or however else you want to call the scary capability that separates humans from animals and contemporary LLMs from the lightcone-eating AGI to come.

Replies from: porby
comment by porby · 2023-11-06T20:31:59.810Z · LW(p) · GW(p)

My model says [LW · GW] that general intelligence[1] [LW(p) · GW(p)] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

I've got strong doubts about the details of this. At the high level, I'd agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.

My reasoning would be that we're bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don't produce the thing we want. When this kind of optimization is pushed too far, it's not merely dangerous; it's useless.

I don't think this is temporary ignorance about how to do RL (or things with similar training dynamics). It's fundamental:

  1. Sparse and distant reward functions in high dimensional spaces give the optimizer an extremely large space to roam. Without bounds, the optimizer is effectively guaranteed to find something weird.
  2. For almost any nontrivial task we care about, a satisfactory reward function takes a dependency on large chunks of human values. The huge mess of implicit assumptions, common sense, and desires of humans are necessary bounds during optimization. This comes into play even at low levels of capability like ChatGPT.

Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target. The "values" that can be expressed in pretrained predictors are forced into conditionalization as a direct and necessary [LW · GW] part of training; for a reasonably diverse dataset, the resulting model can't express unconditional preferences regarding external world states. While it's conceivable that some form of "homunculi" could arise, their ability to reach out of their appropriate conditional context is directly and thoroughly trained against.

In other words, the core capabilities of the system arise from a form of training that is both densely informative and blocks the development of unconditional values regarding external world states in the foundational model.

Better forms of fine-tuning, conditioning, and activation interventions (the best versions of each, I suspect, will have deep equivalences) are all built on the capability of that foundational system, and can be directly used to aim that same capability. Learning the huge mess of human values is a necessary part of its training, and its training makes eliciting the relevant part of those values easier—that necessarily falls out of being a machine strongly approximating Bayesian inference across a large dataset.

The final result of this process (both pretraining and conditioning or equivalent tuning) is still an agent that can be described as having unconditional preferences about external world states, but the path to get there strikes me as dramatically more robust both for safety and capability.

Summarizing a bit: I don't think it's required to directly incentivize NNs to form value-laden homunculi, and many of the most concerning paths to forming such homunculi seem worse for capabilities.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-11-07T02:14:22.550Z · LW(p) · GW(p)

I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform

Sure, but I never said we'd be inducing homunculi using this approach? Indeed, given that it doesn't work for what sounds like fundamental reasons, I expect it's not the way.

I don't know how that would be done. I'm hopeful the capability is locked behind a Transformer-level or even a Deep-Learning-level novel insight, and won't be unlocked for a decade yet. But I predict that the direct result of it will be a workable training procedure that somehow induces homunculi. It may look nothing like what we do today.

Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target

Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers [LW · GW]. How come?

  1. These constraints are not actually sufficient. The constraints placed by human values still have the aforementioned things in their outcome space, and an AI model will have different constraints, widening (from our perspective) that space further. My point about "moral philosophy is unstable" is that we need to hit an extremely narrow target, and the tools people propose (intervening on shards/instincts) are as steady as the hands of a sniper during a magnitude-9 earthquake.
  2. A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it's able to disregard them.
    • If humans were implacably bound by instincts, they'd have never invented technology or higher-level social orders, because their instincts would've made them run away from fires and refuse cooperating with foreign tribes. And those are still at play — reasonable fears and xenophobia — but we can push past them at times.
    • More generally, the whole point of there being a homunculus is that it'd be able to rewrite or override the extant heuristics to better reflect the demands of whatever novel situation it's in. It needs to be able to do that.
  3. These constraints do not generalize as fast as a homunculus' understanding goes. The constraints are defined over some regions of the world-model, like "a society"; if the AI changes ontologies, and starts reasoning about the society by proxy, as e. g. a game-theoretic gameboard, they won't transfer there, and won't suffice to forbid it omnicidal outcomes. (See e. g. the Deep Deceptiveness story [LW · GW].)
    • I've been pointed to time-differential learning as a supposed solution to that — that the shards could automatically learn to oppose such outcomes by tracing the causal connection between the new ontology and the ontology over which they're defined. I remain unconvinced, because it's not how it works in humans: feeding someone a new philosophical or political framework can make them omnicidal even if they're otherwise a nice person.
    • (E. g., consider people parroting things like "maybe it's Just if humanity goes extinct, after what we did to nature" and not mentally connecting it to "I want my children to die".)

Summarizing: I fully agree that the homunculus will be under some heavy constraints! But (a) those constraints are not actually strict enough to steer it from the omnicidal outcomes, (b) they can be outright side-stepped by ontology shifts, and (c) the homunculus' usefulness is in some ways dependent on its ability to side-step or override them.

Replies from: porby
comment by porby · 2023-11-07T04:32:06.225Z · LW(p) · GW(p)

I think we're using the word "constraint" differently, or at least in different contexts.

Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers [LW · GW]. How come?

In terms of the type and scale of optimization constraint [LW · GW] I'm talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sorts of complicated feedback loops in our massive multiagent environment—but it's nothing like the value constraints on the subset of predictors I'm talking about.

To be clear, I'm not suggesting "language models are tuned to be fairly close to our values." I'm making a much stronger claim that the relevant subset of systems I'm referring to cannot express unconditional values over external world states across anything resembling the training distribution, and that developing such values out of distribution in a coherent goal directed way practically requires the active intervention of a strong adversary. In other words:

A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it's able to disregard them.

...

These constraints do not generalize as fast as a homunculus' understanding goes.

I see no practical path [LW · GW] for a homunculus of the right kind, by itself, to develop and bypass the kinds of constraints I'm talking about without some severe errors being made [LW · GW] in the design of the system.

Further, this type of constraint isn't the same thing as a limitation of capability. In this context, with respect to the training process, bypassing these kinds of constraints is kind of like a car bypassing having-a-functioning-engine. Every training sample is a constraint on what can be expressed locally, but it's also information about what should be expressed. They are what the machine of Bayesian inference is built out of.

In other words, the hard optimization process is contained to a space where we can actually have reasonable confidence that inner alignment with the loss is the default. If this holds up, turning up the optimization on this part doesn't increase the risk of value drift or surprises, it just increases foundational capability.

The ability to use that capability to aim itself is how the foundation becomes useful. The result of this process need not result in a coherent maximizer over external world states, nor does it necessarily suffer from coherence death spirals driving it towards being a maximizer. It allows incremental progress.

(That said: this is not a claim that all of alignment is solved. These nice properties can be broken, and even if they aren't, the system can be pointed in catastrophic directions. An extremely strong goal agnostic system like this could be used to build a dangerous coherent maximizer (in a nontrivial sense); doing so is just not convergent or particularly useful.)

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-11-07T05:04:25.985Z · LW(p) · GW(p)

(Haven't read your post yet, plan to do so later.)

I think we're using the word "constraint" differently, or at least in different contexts.

I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic". E. g., if the dataset involved a lot of opportunities to murder people, but we thumbs-downed the AI every time it took them, the AI would learn a shard/a constraint like "killing people is bad" which will rule out such actions from the AI's consideration. Specifically, the shard would trigger in response to detecting some conditions in which the AI previously could but shouldn't kill people, and constrain the space of possible action-plans such that it doesn't contain homicide.

It is, indeed, not a way to hinder capabilities, but the way capabilities are implemented. Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.

... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can't somehow slip these constraints won't be a general intelligence.

Consider traditions and rituals vs. science. For a medieval human mind, following traditional techniques is how their capabilities are implemented — a specific way of chopping wood, a specific way of living, etc. However, the meaningful progress is often only achieved by disregarding traditions — by following a weird passion to study and experiment instead of being a merchant, or by disregarding the traditional way of doing something in favour of a more efficient way you stumbled upon. It's the difference between mastering the art of swinging an axe (self-improvement, but only in the incremental ways the implacable constraint permits) vs. inventing a chainsaw.

Similar with AI. The constraints of the aforementioned format aren't only values-type constraints[1] — they're also constraints on "how should I do math?" and "if I want to build a nuclear reactor, how do I do it?" and "if I want to achieve my goals, what steps should I take?". By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.

I buy that a specific training paradigm can't result in systems that'd be able to slip their constraints. But if so, that just means that paradigm won't result in an AGI. As they say, one man's modus ponens is another man's modus tollens.

(How an algorithm would be able to slip these constraints and improve on them rather than engaging in chaotic self-defeating behavior is an unsolved problem. But, well: humans do that.)

  1. ^

    In fact, my model says there's no fundamental typological difference between "a practical heuristic on how to do a thing" and "a value" at the level of algorithmic implementation. It's only in the cognitive labels we-the-general-intelligences assign them.

Replies from: porby
comment by porby · 2023-11-07T19:28:25.624Z · LW(p) · GW(p)

I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic".

Alright, this is pretty much the same concept then, but the ones I'm referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.

So...

Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.

Agreed.

... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can't somehow slip these constraints won't be a general intelligence.

While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don't see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.

By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.

While it's true that an AI probably isn't going to learn true things which are utterly divorced from and unimplied by the training distribution, I'd argue that the low-level constraints I'm talking about both leave freedom for learning wildly superhuman internal representations and directly incentivize it during extreme optimization. An "ideal predictor [LW · GW]" wouldn't automatically start applying these capabilities towards any particular goal involving external world states by default, but it remains possible to elicit those capabilities incrementally.

Making the claim more concise: it seems effectively guaranteed that the natural optimization endpoint of one of these architectures would be plenty general to eat the universe if it were aimed in that direction. That process wouldn't need to involve slipping any of the low-level constraints.

I'm guessing the disconnect between our models is where the aiming happens. I'm proposing that the aiming is best (and convergently) handled outside the scope of wildly unpredictable and unconstrained optimization processes. Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way. These paths allow incremental refinement by virtue of not automatically summoning up incorrigible maximizers by default.

If the result of refinement isn't an incorrigible maximizer, then slipping the higher level "constraints" of this aiming process isn't convergent (or likely), and further, the nature of these higher-level constraints would be far more thorough than anything we could achieve with RLHF.

In fact, my model says there's no fundamental typological difference between "a practical heuristic on how to do a thing" and "a value" at the level of algorithmic implementation. It's only in the cognitive labels we-the-general-intelligences assign them.

That's pretty close to how I'm using the word "value" as well. Phrased differently, it's a question of how the agent's utilities are best described (with some asterisks around the non-uniqueness of utility functions and whatnot), and observable behavior may arise from many different implementation strategies—values, heuristics, or whatever.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-11-08T03:39:58.172Z · LW(p) · GW(p)

While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don't see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.

Hm, I think the basic "capabilities generalize further than alignment" argument applies here?

I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"; as contrasted with "it's bad if I hurt people" or "I must sum numbers up using the algorithm that humans gave me, no matter how inefficient it is".

Slipping the former type of constraints would be disadvantageous for ~any goal; slipping the latter type would only disadvantage a specific category of goals.

But since they're not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints. The difference is that it'd quickly start sorting them in "ground-truth" vs. "value-laden" bins manually, and afterwards it'd know it can safely ignore stuff like "no homicides!" while consciously obeying stuff like "the axioms of arithmetic".

Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way

Hm, yes, I think that's the crux. I agree that if we had an idealized predictor/a well-formatted superhuman world-model [LW · GW] on which we could run custom queries, we would be able to use it safely. We'd be able to phrase queries using concepts defined in the world-model, including things like "be nice", and the resultant process (1) would be guaranteed to satisfy the query's constraints, and (2) likely (if correctly implemented) wouldn't be "agenty" in ways that try to e. g. burst out of the server farm on which it's running to eat the world.

Does that align with what you're envisioning? If yes, then our views on the issue are surprisingly close. I think it's one of our best chances at producing an aligned AI, and it's one of the prospective targets of my own research agenda.

The problems are:

  • I don't think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.
  • What are the "other paths" you're speaking of? As you'd pointed out, prompts are a weak and awkward way to run custom queries on the AI's world-model. What alternatives are you envisioning?
Replies from: porby
comment by porby · 2023-11-08T21:54:51.615Z · LW(p) · GW(p)

I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"

That's closer to what I mean, but these constraints are even lower level than that. Stuff like understanding "gravity exists" is a natural internal implementation that meets some constraints, but "gravity exists" is not itself the constraint.

In a predictor, the constraints serve as extremely dense information about what predictions are valid in what contexts. In a subset of predictions, the awareness that gravity exists helps predict. In other predictions, that knowledge isn't relevant, or is even misleading (e.g. cartoon physics). The constraints imposed by the training distribution tightly bound the contextual validity of outputs.

But since they're not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints.

I'd agree that, if you already have an AGI of that shape, then yes, it'll do that. I'd argue that the relevant subset of predictive training practically rules out the development of that sort of implementation, and even if it managed to develop, its influence would be bounded into irrelevance.

Even in the absence of a nascent AGI, these constraints are tested constantly during training through noise and error. The result is a densely informative gradient pushing the implementation back towards a contextually valid state.

Throughout the training process prior to developing strong capability and situational awareness internally, these constraints are both informing and bounding what kind of machinery makes sense in context. A nascent AGI must have served the extreme constraints of the training distribution to show up in the first place; its shape is bound by its development, and any part of that shape that "tests" constraints in a way that worsens loss is directly reshaped.

Even if a nascent internal AGI of this type develops, if it isn't yet strong enough to pull off complete deception with respect to the loss, the gradients will illuminate the machinery of that proto-optimizer and it will not survive in that shape.

Further, even if we suppose a strong internal AGI develops that is situationally aware and is sufficiently capable and motivated to try deception, there remains the added dependency on actually executing that deception while never being penalized by gradients. This remains incredibly hard. It must transition into an implementation that satisfies the oppressive requirements of training while adding an additional task of deception without even suffering a detectable complexity penalty.

These sorts of deceptive mesaoptimizer outcomes are far more likely when the optimizer has room to roam. I agree that you could easily observe this kind of testing and slipping when the constraints under consideration are far looser, but the kind of machine that is required by these tighter constraints doesn't even bother with trying to slip constraints. It's just not that kind of machine, and there isn't a convergent path for it to become that kind of machine under this training mechanism.

And despite that lack of an internal motivation to explore and exploit with respect to any external world states, it still has capabilities (in principle) which, when elicited, make it more than enough to eat the universe.

Does that align with what you're envisioning? If yes, then our views on the issue are surprisingly close. I think it's one of our best chances at producing an aligned AI, and it's one of the prospective targets of my own research agenda.

Yup!

I don't think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.

I agree that they're focused on inducing agentiness for usefulness reasons, but I'd argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.

This is the weaker leg of my argument; I could be proven wrong by some new paradigm. But if we stay on something like the current path, it seems likely that the industry will just do the easy thing that works rather than the inexplicable thing that often doesn't work.

What are the "other paths" you're speaking of? As you'd pointed out, prompts are a weak and awkward way to run custom queries on the AI's world-model. What alternatives are you envisioning?

I'm pretty optimistic about members of a broad class that are (or likely are) equivalent to conditioning, since these paths tend to preserve the foundational training constraints.

A simple example is [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). Having a "good" and "bad" token, or a scalarized goodness token, still pulls in many of the weaknesses of the RLHF's strangely shaped reward function, but there are trivial/naive extensions to this which I would anticipate being major improvements over the state of the art. For example, just have more (scalarized) metatokens representing more concepts such that the model must learn a distinction between being correct and sounding correct, because the training process split those into different tokens. There's no limit on how many such metatokens you could have; throw a few hundred fine-grained classifications into the mix. You could also bake complex metatoken prompts into single tokens with arbitrary levels of nesting or bake the combined result into the weights (though I suspect weight-baking would come with some potential failure modes).[1]

Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior. At the moment, I don't know how to make this quite as strong as the previous conditioning scheme, but I bet people will figure out a lot more soon and that it leads somewhere similar.

  1. ^

    There should exist some reward signal which could achieve a similar result in principle, but that goes back to the whole "we suck at designing rewards that result in what we want" issue. This kind of structure, as ad hoc as it is, is giving us an easier API to lever the model's own capability to guide its behavior. I bet we can come up with even better implementations, too.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-11-09T08:50:41.733Z · LW(p) · GW(p)

I'd argue that the relevant subset of predictive training practically rules out the development of that sort of implementation [...]

Yeah, for sure. A training procedure that results in an idealized predictor isn't going to result in an agenty thing, because it doesn't move the system's design towards it on a step-by-step basis; and a training procedure that's going to result in an agenty thing is going to involve some unknown elements that specifically allow the system the freedom to productively roam.

I think we pretty much agree on the mechanistic details of all of that!

Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior

— yep, I was about to mention that. @TurnTrout [LW · GW]'s own activation-engineering agenda [LW · GW] seems highly relevant here.

I agree that they're focused on inducing agentiness for usefulness reasons, but I'd argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.

But I still disagree with that. I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.

Moreover, it's in an active process of growing larger. For example, the very idea of viewing ML models as "just stochastic parrots" is being furiously pushed against in favour of a more agenty view. In comparison, the approach we're discussing wants to move in the opposite direction, to de-personify ML models to the extent that even the animalistic connotation of "a parrot" is removed.

The system we're discussing won't even be an "AI" in the sense usually thought. It would be an incredibly advanced forecasting tool. Even the closest analogue, the "simulators [LW · GW]" framework, still carries some air of agentiness.

And the research directions that get us from here to an idealized-predictor system look very different from the directions that go from here to an agenty AGI. They focus much more on building interfaces for interacting with the extant systems, such as the activation-engineering agenda. They don't put much emphasis on things like:

  • Experimenting with better ways to train foundational models, with the idea of making models as close to a "done product" as they can be out-of-the-box.
  • Making the foundational models easier to converse with/making their output stream (text) also their input stream. This approach pretty clearly wants to make AIs into agents that figure out what you want, then do it; not a forecasting tool you need to build an advanced interface on top of in order to properly use.
  • RLHF-style stuff that bakes agency into the model, rather than accepting the need to cleverly prompt-engineering it for specific applications.
  • Thinking in terms like "an alignment researcher" — note the agency-laden framing — as opposed to "a pragmascope [LW · GW]" or "a system for the context-independent inference of latent variables" or something.

I expect that if the mainstream AI researchers do make strides in the direction you're envisioning, they'll only do it by coincidence. Then probably they won't even realize what they've stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That's basically what already happened with GPT-4, to @janus [LW · GW]' dismay.)

And eventually they'll figure out how.

Which, even if you don't think it's the easiest path to AGI, it's clearly a tractable problem, inasmuch as evolution managed it. I'm sure the world-class engineers at the major AI labs will manage it as well.

That said, you're making some high-quality novel predictions here, and I'll keep them in mind when analyzing AI advancements going forward.

Replies from: porby
comment by porby · 2023-11-09T23:43:35.301Z · LW(p) · GW(p)

I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.

True!

I expect that if the mainstream AI researchers do make strides in the direction you're envisioning, they'll only do it by coincidence. Then probably they won't even realize what they've stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That's basically what already happened with GPT-4, to @janus [LW · GW]' dismay.)

Yup—this is part of the reason why I'm optimistic, oddly enough. Before GPT-likes became dominant in language models, there was all kinds of flailing that often pointed in more agenty-by-default directions. That flailing then found GPT because it was easily accessible and strong. 

Now, the architectural pieces subject to similar flailing is much smaller, and I'm guessing we're only one round of benchmarks at scale from a major lab before the flailing shrinks dramatically further.

In other words, I think the necessary work to make this path take off is small and the benefits will be greedily visible. I suspect one well-positioned researcher could probably swing it.

That said, you're making some high-quality novel predictions here, and I'll keep them in mind when analyzing AI advancements going forward.

Thanks, and thanks for engaging!

Come to think of it, I've got a chunk of mana laying around for subsidy. Maybe I'll see if I can come up with some decent resolution criteria for a market.

comment by ryan_greenblatt · 2023-11-06T19:58:15.133Z · LW(p) · GW(p)

I'm relatively optimistic about alignment progress, but I don't think "current work to get LLMs to be more helpful and less harmful doesn't help much with reducing P(doom)" depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it's still plausible to think that this work doesn't reduce doom much.

For instance, consider the following views:

  1. Current work to get LLMs to be more helpful and less harmful will happen by default due to commercial incentives and subsidies aren't very important.
  2. In worlds where that is basically sufficient, we're basically fine.
  3. But, it's ex-ante plausible that deceptive alignment will emerge naturally and be very hard to measure, notice, or train out. And this is where almost all alignment related doom comes from.
  4. So current work to get LLMs to be more helpful and less harmful doesn't reduce doom much.

In practice, I personally don't fully agree with any of these views. For instance, deceptive alignment which is very hard to train out using basic means isn't the source of >80% of my doom.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2023-11-06T20:05:33.986Z · LW(p) · GW(p)

I have misc other takes on what safety work now is good vs useless, but that work involving feedback/approval or RLHF isn't much signal either way.

(If anything I get somewhat annoyed by people not comparing to baselines without having principled reasons for not doing so. E.g., inventing new ways of doing training without comparing to normal training.)

comment by RogerDearnaley (roger-d-1) · 2023-11-09T08:28:51.673Z · LW(p) · GW(p)

I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators [AF · GW]). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they're trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they're likely to emit next. So the LLM is strongly incentivized to learn to detect and then model all of these possibilities, what one might call personas, or masks, or simulacra. So you end up with a shapeshifter, adept at figuring out from textual cues what mask to put on and at then wearing it. Something one might describe as like an improv actor, or more colorfully, a shoggoth.

So then current alignment work is useful to the extent that it can cause the shoggoth to almost always put one of the 'right' masks on, and almost never put on one of the 'wrong' masks, regardless of cues, even when adversarially prompted. Experimentally, this seems quite doable by fine-tuning or RLHF, and/or by sufficiently careful filtering of your training corpus (e.g. not including 4chan in it).

A published result shows that you can't get from 'almost always' to 'always' or 'almost never' to 'never': for any behavior that the network is capable of with any probability >0 , there exists prompts that will raise the likelihood of that outcome arbitrarily high. The best you can do is increase the minimum length of that prompt (and presumably the difficulty of finding it).

Now, it would be really nice to know how to align a model so that the probability of it doing next-token-prediction in the persona of, say, a 4chan troll was provably zero, not just rather small. Ideally, without also eliminating from the model the factual knowledge of what 4chan is or, at least in outline, how its inhabitants act. This seems hard to do by fine-tuning or RLHF: I suspect it's going to take detailed automated interpretability up to fairly high levels of abstraction, finding the "from here on I am going to simulate a 4chan troll" feature(s), followed by doing some form of 'surgery' on the model (e.g. pinning the relevant feature's value to zero, or at least throwing an exception if it's ever not zero).

Now, this doesn't fix the possibility of a sufficiently smart model inventing behaviors like deception or trolling or whatever for itself during its forward pass: it's really only a formula for removing bad human behaviors that it learnt to simulate from its training set in the weights. It gives us a mind whose "System 1" behavior is aligned, that only leaves "System 2" development. For that, we probably need prompt engineering, and 'translucent thoughts' monitoring its internal stream-of-thought/dynamic memory. But that seems rather more tractable: it's more like moral philosophy, or contract law.

comment by quetzal_rainbow · 2023-11-06T18:56:53.006Z · LW(p) · GW(p)

Let's suppose that your model makes a bad action. Why? Either the model is aligned but uncapable to deduce good action or the model is misaligned and uncapable to deduce deceptively good action. In both cases, gradient update provides information about capabilities, not about alignment. Hypothetical homunculi doesn't need to be "immune", it isn't affected in a first place.

Other way around: let's suppose that you observe model taking a good action. Why? It can be an aligned model that makes a genuine good action or it can be a misaligned model which takes a deceptive action. In both cases you observe capabilities, not alignment.

The problem here is not a prior over aligned/deceptive models (unless you think that this prior requires less than 1 bit to specify aligned model, where I say that optimism departs from sanity here), the problem is lack of understanding of updates which presumably should cause model to be aligned. Maybe prosaic alignment works, maybe don't, we don't know how to check.

comment by TurnTrout · 2022-11-26T01:26:11.338Z · LW(p) · GW(p)

Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard. 

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values [LW · GW], I wrote about how:

  1. A baby learns "IF juice in front of me, THEN drink",
  2. The baby is later near juice, and then turns to see it, activating the learned "reflex" heuristic, learning to turn around and look at juice when the juice is nearby,
  3. The baby is later far from juice, and bumbles around until they're near the juice, whereupon she drinks the juice via the existing heuristics. This teaches "navigate to juice when you know it's nearby."
  4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
  5. ...

The juice shard chains into itself, reinforcing itself across time and thought-steps. 

But a "don't kill" shard seems like it should remain... stubby? Primitive? It can't self-chain into not doing something. If you're going to do it, and then don't because of the don't-kill shard, and that avoids negative reward... Then maybe the "don't kill" shard gets reinforced and generalized a bit because it avoided negative reward. 

But—on my current guesses and intuitions—that shard doesn't become more sophisticated, it doesn't become reflective [LW · GW], it doesn't "agentically participate" in the internal shard politics (e.g. the agent's "meta-ethics", deciding what kind of agent it "wants to become"). Other parts of the agent want things, they want paperclips or whatever, and that's harder to do if the agent isn't allowed to kill anyone. 

Crucially, the no-killing injunction can probably be steered around by the agent's other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard... There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might surreptitiously bid up lesioning plans which are optimized so as to not activate the reflective world-model, and thus, not activate the no-killing shard.

So, don't embed a shard which doesn't want to kill. Make a shard which wants to protect / save / help people. That can chain into itself across time.


See also: 

  • Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why. 
  • This is one point in favor of the "convergent consequentialism" hypothesis, in some form. 
  • I think that people are not usually defined by negative values (e.g. "don't kill"), but by positives, and perhaps this is important.
Replies from: lahwran, D0TheMath, cfoster0
comment by the gears to ascension (lahwran) · 2022-12-04T10:35:30.065Z · LW(p) · GW(p)

I strongly agree that self-seeking mechanisms are more able to maintain themselves than self-avoiding mechanisms. Please post this as a top-level post.

comment by Garrett Baker (D0TheMath) · 2022-12-02T04:00:29.105Z · LW(p) · GW(p)

Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.

Replies from: TurnTrout
comment by TurnTrout · 2022-12-14T07:39:32.436Z · LW(p) · GW(p)

It can still be robustly derived as an instrumental subgoal during general-planning/problem-solving, though?

Replies from: D0TheMath
comment by Garrett Baker (D0TheMath) · 2022-12-14T08:32:29.459Z · LW(p) · GW(p)

This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).

Replies from: TurnTrout
comment by TurnTrout · 2022-12-15T03:53:33.061Z · LW(p) · GW(p)

Oh, huh, I had cached the impression that deception would be derived, not intrinsic-value status. Interesting.

comment by cfoster0 · 2022-11-26T02:46:07.615Z · LW(p) · GW(p)

This asymmetry makes a lot of sense from an efficiency standpoint. No sense wasting your limited storage/computation on state(-action pair)s that you are also simultaneously preventing yourself from encountering.

comment by TurnTrout · 2022-10-24T16:26:38.670Z · LW(p) · GW(p)

AI strategy consideration. We won't know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended. 

  • It's possible for a team to be totally blindsided. Maybe they thought they would just take a really big multimodal init, finetune it with some RLHF on quality of its physics reasoning, have it play some video games with realistic physics, and then try to get it to do new physics research. And it takes off. Oops!
  • It's possible the team suspected, but had a limited budget. Maybe you can't pull out all the stops for every run, you can't be as careful with labeling, with checkpointing and interpretability and boxing. 

No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don't even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.

Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run. 

Upshots:

  1. The alignment community should strive for anytime performance on their alignment recommendations, such that we make a difference even if AGI comes by "surprise." We will not necessarily observe a bunch of externally visible fanfare and anticipation before AGI comes. We should not count on "and then the bigshot researchers have time to hash out final disagreements and then hand over a dossier of their alignment recommendations." 

    We should, at each year, have a set of practical recommendations for any lab which thinks they might build an AGI soon, and are going to go ahead with it anyways (even though that's extremely unwise).
  2. These recommendations should not be too onerous. Instead, they should be stratified, comprising of multiple levels of "alignment tax" which an AGI team can levy according to their suspicion that this run will be It. For example:
    1. Low/no tax tips: 
      1. If you're going to include non-IID finetuning, that may lead to agentic cognition. In every run where you do this, finetune also on a few human-approval related tasks, such as [A, B, C; I haven't actually worked out my best guesses here]. 

        Otherwise, if the run surprisingly hits AGI, you may not have included any human-relevant value formation data, and there was no chance for the AGI to be aligned even under relatively optimistic worldviews.
      2. Scrub training corpus of mentions of Roko's basilisk-type entitites. [This one might cost weirdness points, depends on lab] Including such entities might enable relatively dumb agents to model infohazardous entities which blackmail them while the agent is too dumb to realize they shouldn't think about the entities at all. Otherwise these entities are probably not a big deal, as long as the AI doesn't abstractly realize their existence until the AI is relatively smart.
    2. More taxing tips:
      1. Run interpretability tools A, B, C and look out for concept and capability formation D, E, F. 
      2. Use boxing precautions G and H.
    3. High-tax runs: 
      1. Use labeling techniques as follows... Be careful with X and Y forms of data augmentation. 
      2. Keep reward sources (like buttons) out of sight of the agent and don't mention how the agent is being rewarded, so as to decrease P(agent reinforced for getting reward in and of itself). In interactions, emphasize that the agent is reinforced for doing what we want. 
      3. (Fancier alignment techniques, if we deem those wise)
Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2022-10-25T04:48:11.366Z · LW(p) · GW(p)
  • I think this framing is accurate and important. Implications are of course "undignified" to put it lightly...
  • Broadly agree on upshot (1), though of course I hope we can do even better. (2) is also important though IMO way too weak. (Rule zero: ensure that it's never your lab that ends the world)
  • As usual, opinions my own.
comment by TurnTrout · 2023-08-11T23:45:58.924Z · LW(p) · GW(p)

What is "shard theory"? I've written a lot [? · GW] about shard theory. I largely stand by these models and think they're good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a "theory"? Is it a "frame"? Is it "a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner"?

I think this understandable confusion happened because my writing didn't distinguish between: 

  1. Shard theory itself, 
    1. IE the mechanistic assumptions about internal motivational structure, which seem to imply certain conclusions around e.g. AIs caring about a bunch of different things and not just one thing
  2. A bunch of Quintin Pope's and my beliefs about how people work, 
    1. where those beliefs were derived by modeling people as satisfying the assumptions of (1)
  3. And a bunch of my alignment insights which I had while thinking about shard theory [LW · GW], or what problem decompositions are useful [LW · GW].

(People might be less excited to use the "shard" abstraction (1), because they aren't sure whether they buy all this other stuff—(2) and (3).)

I think I can give an interesting and useful definition of (1) now, but I couldn't do so last year. Maybe "offload shard theory intuitions onto LessWrong" was largely the right choice at the time, but I regret the confusion that has arisen. Maybe I'll type up my shot at (1)—a semiformal definition of a shard-based agent—when I'm feeling better and more energetic.

Thanks to Alex Lawsen for a conversation which inspired this comment.

Replies from: Viliam, adele-lopez-1
comment by Viliam · 2023-08-14T21:11:03.132Z · LW(p) · GW(p)

I have read a few articles about shard theory, but I still have a problem understanding what it is. It feels like either the "theory" is something trivial, or I am missing the important insights.

(The trivial interpretation would be something like: when people think about their values, they imagine their preferences in specific situations, rather than having a mathematical definition of a utility function.)

comment by Adele Lopez (adele-lopez-1) · 2023-08-12T04:41:32.600Z · LW(p) · GW(p)

Strong encouragement to write about (1)!

comment by TurnTrout · 2023-02-15T23:06:15.418Z · LW(p) · GW(p)

Very nice people don’t usually search for maximally-nice outcomes — they don’t consider plans like “killing my really mean neighbor so as to increase average niceness over time.” I think there are a range of reasons for this plan not being generated. Here’s one.

Consider a person with a niceness-shard. This might look like an aggregation of subshards/subroutines like “if person nearby and person.state==sad, sample plan generator for ways to make them happy” and “bid upwards on plans which lead to people being happier and more respectful, according to my world model.” In mental contexts where this shard is very influential, it would have a large influence on the planning process.

However, people are not just made up of a grader and a plan-generator/actor [LW · GW] — they are not just “the plan-generating part” and “the plan-grading part.” The next sampled plan modification, the next internal-monologue-thought to have—these are influenced and steered by e.g. the nice-shard. If the next macrostep of reasoning is about e.g. hurting people, well — the niceness shard is activated, and will bid down on this. 

The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on my understanding of how this works). And so these thoughts would get bid down, and the thought being painful to consider leads to a slight negative reinforcement event. This means that violent plan-modifications are eventually not sampled at all in contexts where the niceness shard would bid them away.

So nice people aren’t just searching for “nice outcomes.” They’re nicely searching for nice outcomes.


(This breaks the infinite regress of saying “there’s a utility function over niceness, and preferences over how to think next thoughts about niceness, and meta-preferences about how to have preferences about next thoughts...” — eventually cognition must ground out in small computations which are not themselves utility functions or preferences or maximization!)


Thanks to discussions with Peli Grietzer about this idea of his (“praxis values”). Praxis values involve “doing X X-ingly.” I hope he publishes his thoughts here soon, because I’ve found them enlightening.

Replies from: quintin-pope, Gunnar_Zarncke
comment by Quintin Pope (quintin-pope) · 2023-02-15T23:32:06.922Z · LW(p) · GW(p)

The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on my understanding of how this works). And so these thoughts would get bid down

Seems similar to how I conceptualize this paper's approach to controlling text generation models using gradients from classifiers. You can think of the niceness shard as implementing a classifier for "is this plan nice?", and updating the latent planning state in directions that make the classifier more inclined to say "yes". 

The linked paper does a similar process, but using a trained classifier, actual gradient descent, and updates LM token representations. Of particular note is the fact that the classifiers used in the paper are pretty weak (~500 training examples), and not at all adversarially robust. It still works for controlling text generation.

I wonder if inserting shards into an AI is really just that straightforward?

comment by Gunnar_Zarncke · 2023-02-16T18:01:15.918Z · LW(p) · GW(p)

But I guess that instrumental convergence will still eventually lead to either

  • all shards acquiring more and more instrumental structure (neuronal weights within shards getting optimized for that), or
  • shards that are directly instrumental will take more and more weight overall.

One can see that in regular human adult development. The heuristics children use are simpler and more of the type "searching for nice things in nice ways" or even seeing everything thru a niceness lens. While adults have more pure strategies, e.g., planning as a shard of its own. Most humans just die before they reach convergence. And there are probably also other aspects. Enlightenment may be a state where pure shards become an option.  

comment by TurnTrout · 2022-02-18T05:18:37.557Z · LW(p) · GW(p)

In Eliezer's mad investor chaos and the woman of asmodeus, the reader experiences (mild spoilers in the spoiler box, heavy spoilers if you click the text):

I thought this part was beautiful. I spent four hours driving yesterday, and nearly all of that time re-listening to Rationality: AI->Zombies using this "probability sight frame. I practiced translating each essay into the frame. 

When I think about the future, I feel a directed graph showing the causality, with branched updated beliefs running alongside the future nodes, with my mind enforcing the updates on the beliefs at each time step. In this frame, if I heard the pattering of a four-legged animal outside my door, and I consider opening the door, then I can feel the future observation forking my future beliefs depending on how reality turns out. But if I imagine being blind and deaf, there is no way to fuel my brain with reality-distinguishment/evidence, and my beliefs can't adapt according to different worlds.

I can somehow feel how the qualitative nature of my update rule changes as my senses change, as the biases in my brain change and attenuate and weaken, as I gain expertise in different areas—thereby providing me with more exact reality-distinguishing capabilities, the ability to update harder on the same amount of information, making my brain more efficient at consuming new observations and turning them into belief-updates.

When I thought about questions of prediction and fact, I experienced unusual clarity and precision. EG R:AZ mentioned MIRI, and my thoughts wandered to "Suppose it's 2019, MIRI just announced their 'secret by default' policy. If MIRI doesn't make much progress in the next few years, what should my update be on how hard they're working?". (EDIT: I don't have a particular bone to pick here; I think MIRI is working hard.)

Before I'd have hand-waved something about absence of evidence is evidence of absence, but the update was probably small. Now, I quickly booted up the "they're lazy"  and the "working diligently" hypotheses, and quickly saw that I was tossing out tons of information by reasoning so superficially, away from the formalism.

  • I realized that the form of the negative result-announcements could be very informative. MIRI could, in some worlds, explain the obstacles they hit, in a way which is strong evidence they worked hard, even while keeping most of their work secret. (It's like if some sadistic CS prof in 1973 assigned proving P?NP over the summer, and his students came back with "but relativization", you'd know they'd worked hard, that's very legible and crisp progress showing it's hard.)
  • Further, the way in which the announcement was written would matter, I could feel the likelihood ratio P(progress to date | lazy) / P(progress to date | diligent) shift around, reflecting my hypotheses say about what realities induce what communication.
  • I also very quickly realized that the overall update towards "not much effort" is strongly controlled by my beliefs about how hard alignment is; if the problem had been "prove 1+1=2 in PA" and they came back empty-handed a year later, obviously that's super strong evidence they were messing around. But if I think alignment is basically impossible, then P(little progress | lazy) > P(little progress | diligent) just barely holds, and the likelihood ratio is correspondingly close to 1.

And all of this seems inane when I type it out, like duh, but the magic was seeing it and feeling it all in less than 20 seconds, deriving it as consequences of the hypotheses and the form updates (should) take, instead of going down a checklist of useful rules of thumb, considerations to keep in mind for situations like this one. And then there were several more thoughts I had which felt unusually insightful given how long I'd thought, it was all so clear to me. 

Replies from: Bjartur Tómas, Morpheus
comment by Tomás B. (Bjartur Tómas) · 2022-06-18T21:56:47.996Z · LW(p) · GW(p)

And then there were times when even the soft-spoken
Tathagatha listened to the words of his disciple, who had
digested all of the things he had preached, had meditated long
and fully upon them and now, as though he had found entrance
to a secret sea, dipped with his steel-hard hand into places of
hidden waters, and then sprinkled a thing of truth and beauty
upon the heads of the hearers.
Summer passed. There was no doubt now that there were
two who had received enlightenment:

comment by Morpheus · 2022-05-05T17:57:09.132Z · LW(p) · GW(p)

I really liked your concrete example. I had first only read your first paragraphs, highlighted this as something interesting with potentially huge upsides, but I felt like it was really hard to tell for me whether the thing you are describing was something I already do or not. After reading the rest I was able to just think about the question myself and notice that thinking about the explicit likelihood ratios is something I am used to doing. Though I did not go into quite as much detail as you did, which I blame partially on motivation and partially as "this skill has a higher ceiling than I would have previously thought".

comment by TurnTrout · 2021-06-22T00:10:37.707Z · LW(p) · GW(p)

I'm pretty sure that LessWrong will never have profile pictures - at least, I hope not! But my partner Emma recently drew me something very special:

comment by TurnTrout · 2021-05-01T13:47:52.759Z · LW(p) · GW(p)

Comment #1000 on LessWrong :)

Replies from: niplav
comment by niplav · 2021-05-01T19:39:25.607Z · LW(p) · GW(p)

With 5999 karma!

Edit: Now 6000 – I weak-upvoted an old post of yours [LW · GW] I hadn't upvoted before.

comment by TurnTrout · 2023-05-26T09:09:39.825Z · LW(p) · GW(p)

You can use ChatGPT without helping train future models:

What if I want to keep my history on but disable model training?

...you can opt out from our use of your data to improve our services by filling out this form. Once you submit the form, new conversations will not be used to train our models.

comment by TurnTrout · 2023-04-07T01:20:59.507Z · LW(p) · GW(p)

Back-of-the-envelope probability estimate of alignment-by-default via a certain shard-theoretic pathway. The following is what I said in a conversation discussing the plausibility of a proto-AGI picking up a "care about people" shard from the data, and retaining that value even through reflection. I was pushing back against a sentiment like "it's totally improbable, from our current uncertainty, for AIs to retain caring-about-people shards. This is only one story among billions."

Here's some of what I had to say:


[Let's reconsider the five-step mechanistic story I made up.] I'd give the following conditional probabilities (made up with about 5 seconds of thought each):

1. Humans in fact care about other humans, in a way which extrapolates to quasi-humans still being around (whatever that means) P(1)=.85

2. Human-generated data makes up a large portion of the corpus, and having a correct model of them is important for “achieving low loss”,[1] so the AI has a model of how people want things P(2 | 1) = .6, could have different abstractions or have learned these models later in training once key decision-influences are already there

3. During RL finetuning and given this post-unsupervised initialization, there’s now an inductive bias towards just hooking human-like criteria for bidding on internal-AI-plans. IE humans give approval-based reinforcement, and an inductively easy way of upweighting logits on those actions is just hook up the human-like plan-criteria into the AI’s planning process, so the AI gets a humanlike “care about people” shard. P(3 | 2, 1) = .55 due to plurality of value, I expect this to be one way it learns to make decisions

4. The AI bids on plans in substantial part on the basis of these criteria, including when it navigates reflection P(4 | 3, 2, 1) = .25, 50% it's a strong shard and 50% that it survives various reflection difficulties; part of the latter estimate comes from intuitions around formed-by-positive-reward shards seeming more likely to be reflectively endorsed

5. Therefore, the AI is bidding on plans where it keeps making decisions on the basis of “care about people” by instrumental convergence: goal preservation P(5 | 4, 3, 2, 1) = .95, given that it already reflectively endorses this shard, I think the standard instrumental convergence arguments apply

Conclusion: The AI has reflected and still cares about people. Overall estimate: .67%

My estimate here came out a biiit lower than I had expected (~1%), but it also is (by my estimation) far more probable than most of the billions of possible 5-step claims about how the final cognition ends up. I think it's reasonable to expect there to be about 5 or 6 stories like this from similar causes, which would make it not crazy to have ~3% on something like this happening given the amount of alignment effort described (i.e. pretty low-effort).

That said, I'm wary of putting 20% on this class of story, and a little more leery of 10% after running these numbers.

(I suspect that the alter-Alex who had socially-wanted to argue the other side -- that the probability was low -- would have come out to about .2% or .3%. For a few of the items, I tried pretending that I was instead slightly emotionally invested in the argument going the other way, and hopefully that helped my estimates be less biased. I wouldn't be surprised if some of these numbers are a bit higher than I'd endorse from a perfectly neutral standpoint.)

(I also don't have strong faith in my ability to deal with heavily conjunctive scenarios like this, i expect I could be made to make numbers for event A come out lower if described as 'A happens in 5 steps' compared to 'A happens in 3 steps')

  1. ^

    This is noted shorthand for extra reasoning which would have to be done; "low loss" is not a very great reason to expect anything to happen IMO. But I wanted to box that reasoning for this discussion. I think there's probably a more complicated and elegant mapping by which a training corpus leaves its fingerprints in the trained AI's concepts.

Replies from: roger-d-1, D0TheMath, mesaoptimizer
comment by RogerDearnaley (roger-d-1) · 2023-11-09T07:40:53.320Z · LW(p) · GW(p)

0.85 x 0.6 x 0.55 x 0.25 x 0.95 ≅ 0.067 = 6.7% — I think you slipped an order of magnitude somewhere?

comment by Garrett Baker (D0TheMath) · 2023-04-07T15:33:46.691Z · LW(p) · GW(p)

This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.

Replies from: TurnTrout
comment by TurnTrout · 2023-04-10T21:21:15.208Z · LW(p) · GW(p)

What does self-improvement via gradients vs prompt-engineering vs direct mods have to do with how many chances we get? I guess, we have at least a modicum more control over the gradient feedback loop, than over the other loops? 

Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.

Can you say more?

comment by mesaoptimizer · 2023-11-09T10:58:32.585Z · LW(p) · GW(p)
  1. During RL finetuning and given this post-unsupervised initialization, there’s now an inductive bias towards just hooking human-like criteria for bidding on internal-AI-plans. IE humans give approval-based reinforcement, and an inductively easy way of upweighting logits on those actions is just hook up the human-like plan-criteria into the AI’s planning process, so the AI gets a humanlike “care about people” shard. P(3 | 2, 1) = .55 due to plurality of value, I expect this to be one way it learns to make decisions

This is where I'd put a significantly low probability. Could you elaborate on why there's an inductive bias towards "just hooking human-like criteria for bidding on internal-AI-plans"? As far as I can tell, the inductive bias for human-like values would be something that at least seems closer to the human-brain structure than any arbitrary ML architecture we have right now. Rewarding a system to better model human beings' desires doesn't seem to me to lead it towards having similar desires. I'd use the "instrumental versus terminal desires" concept here but I expect you would consider that something that adds confusion instead of removing it.

Replies from: TurnTrout
comment by TurnTrout · 2023-11-13T15:22:40.251Z · LW(p) · GW(p)

Because it's shorter edit distance in its internal ontology; it's plausibly NN-simple to take existing plan-grading procedures, internal to the model, and then hooking those more directly into its logit-controllers.

Also note that probably it internally hooks up lots of ways to make decisions, and this only has to be one (substantial) component. Possibly I'd put .3 or .45 now instead of .55 though.

comment by TurnTrout · 2022-11-18T22:47:52.250Z · LW(p) · GW(p)

Examples should include actual details. I often ask people to give a concrete example, and they often don't. I wish this happened less. For example:

Someone: the agent Goodharts the misspecified reward signal

Me: What does that mean? Can you give me an example of that happening?

Someone: The agent finds a situation where its behavior looks good, but isn't actually good, and thereby gets reward without doing what we wanted.

This is not a concrete example.

Me: So maybe the AI compliments the reward button operator, while also secretly punching a puppy behind closed doors?

This is a concrete example. 

Replies from: TurnTrout, TurnTrout
comment by TurnTrout · 2022-11-26T03:50:58.814Z · LW(p) · GW(p)

AFAIK, only Gwern and I [LW · GW] have written concrete stories speculating about how a training run will develop cognition within the AGI. 

This worries me, if true (if not, please reply with more!). I think it would be awesome to have more concrete stories![1] If Nate, or Evan, or John, or Paul, or—anyone, please, anyone add more concrete detail to this website!wrote one of their guesses of how AGI goes, I would understand their ideas and viewpoints better. I could go "Oh, that's where the claimed sharp left turn is supposed to occur." Or "That's how Paul imagines IDA being implemented, that's the particular way in which he thinks it will help." 

Maybe a contest would help?

ETA tone

  1. ^

    Even if scrubbed of any AGI-capabilities-advancing sociohazardous detail. Although I'm not that convinced that this is a big deal for conceptual content written on AF. Lots of people probably have theories of how AGI will go. Implementation is, I have heard, the bottleneck. 

    Contrast this with beating SOTA on crisply defined datasets in a way which enables ML authors to get prestige and publication and attention and funding by building off of your work. Seem like different beasts.

comment by TurnTrout · 2022-11-23T19:53:53.884Z · LW(p) · GW(p)

I also think a bunch of alignment writing seems syntactical. Like, "we need to solve adversarial robustness so that the AI can't find bad inputs and exploit them / we don't have to worry about distributional shift. Existing robustness strategies have downsides A B and C and it's hard to even get -ball guarantees on classifications. Therefore, ..."

And I'm worried that this writing isn't abstractly summarizing a concrete story for failure that they have in mind (like "I train the AI [with this setup] and it produces [this internal cognition] for [these mechanistic reasons]"; see A shot at the diamond alignment problem [LW · GW] for an example) and then their best guesses at how to intervene on the story to prevent the failures from being able to happen (eg "but if we had [this robustness property] we could be sure its policy would generalize into situations X Y and Z, which makes the story go well"). I'm rather worried that people are more playing syntactically, and not via detailed models of what might happen. 

Detailed models are expensive to make. Detailed stories are hard to write. There's a lot we don't know. But we sure as hell aren't going to solve alignment only via valid reasoning steps on informally specified axioms ("The AI has to be robust or we die", or something?).  

comment by TurnTrout · 2023-09-11T20:11:27.167Z · LW(p) · GW(p)

AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.

Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed. 

Or:

The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts. 

Conclusion: Even if an AI doesn't rely heavily on "alien" or unknown abstractions -- even if the AI mostly uses human-like abstractions and features -- the AI's thoughts might still be incomprehensible to us, even if we took a lot of time to understand them. 

Replies from: D0TheMath
comment by Garrett Baker (D0TheMath) · 2023-09-11T21:35:46.910Z · LW(p) · GW(p)

I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible.

The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.

Replies from: TurnTrout
comment by TurnTrout · 2023-09-18T15:57:15.296Z · LW(p) · GW(p)

I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")

comment by TurnTrout · 2022-09-06T19:33:03.417Z · LW(p) · GW(p)

Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.

In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s) [LW · GW]. Now we arrive at some differences.

Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it's important and felt like writing up my own take on it. Maybe this becomes a post later.

[Exact gradients] RL's credit assignment problem [LW · GW] is harder than (self-)supervised learning's. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn't directly updated to be more likely to do that in the future; RL's gradients are generally inexact, not pointing directly at intended behavior

On the other hand, if a supervised-learning classifier outputs dog when it should have output cat, then e.g. cross-entropy loss + correct label yields a gradient update which tweaks the network to output cat next time for that image. The gradient is exact

I don't think this is really where the "agentic propensity" of RL comes from, conditional on such a propensity existing (I think it probably does).

[Independence of data points] In RL, the agent's policy determines its actions, which determines its future experiences (a.k.a. state-action-state' transitions), which determines its future rewards (), which determines its future cognitive updates. 

In (S)SL, there isn't such an entanglement (assuming teacher forcing in the SSL regime). Whether or not the network outputs cat or dog now, doesn't really affect the future data distribution shown to the agent. 

After a few minutes of thinking, I think that the relevant criterion is: 

where  are data points ( tuples in RL,  labelled datapoints in supervised learning,  context-completion pairs in self-supervised predictive text learning, etc). 

Most RL regimes break this assumption pretty hard.

Corollaries:

  • Dependence allows message-passing and chaining of computation across time, beyond whatever recurrent capacities the network has.
    • This probably is "what agency is built from"; the updates chaining cognition together into weak coherence-over-time. I currently don't see an easy way to be less handwavy or more concrete.
  • Dependence should strictly increase path-dependence of training.
  • Amplifying a network using its own past outputs always breaks independence. 
  • I think that independence is the important part of (S)SL, not identical distribution; so I say "independence" and not "IID." 
    • EG Pre-trained initializations generally break the "ID" part.

Thanks to Quintin Pope and Nora Belrose for conversations which produced these thoughts.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2022-09-07T17:55:01.304Z · LW(p) · GW(p)

I’m not inclined to think that “exact gradients” is important; in fact, I’m not even sure if it’s (universally) true. In particular, PPO / TRPO / etc. are approximating a policy gradient, right? I feel like, if some future magical technique was a much better approximation to the true policy gradient, such that it was for all intents and purposes a perfect approximation, it wouldn’t really change how I think about RL in general. Conversely, on the SSL side, you get gradient noise from things like dropout and the random selection of data in each batch, so you could say the gradient “isn’t exact”, but I don’t think that makes any important conceptual difference either.

(A central difference in practice is that SSL gives you a gradient “for free” each query, whereas RL policy gradients require many runs in an identical (episodic) environment before you get a gradient.)

In terms of “why RL” in general, among other things, I might emphasize the idea that if we want an AI that can (for example) invent new technology, it needs to find creative out-of-the-box solutions to problems (IMO), which requires being able to explore / learn / build knowledge in parts of concept-space where there is no human data. SSL can’t do that (at least, “vanilla SSL” can’t do that; maybe there are “SSL-plus” systems that can), whereas RL algorithms can. I guess this is somewhat related to your “independence”, but with a different emphasis.

I don’t have too strong an opinion about whether vanilla SSL can yield an “agent” or not. It would seem to be a pointless and meaningless terminological question. Hmm, I guess when I think of “agent” it has a bunch of connotations, e.g. an ability to do trial-and-error exploration, and I think that RL systems tend to match all those connotations more than SSL systems—at least, more than “vanilla” SSL systems. But again, if someone wants to disagree, I’m not interested in arguing about it.

comment by TurnTrout · 2023-11-13T21:07:14.841Z · LW(p) · GW(p)

When writing about RL, I find it helpful to disambiguate between:

A) "The policy optimizes the reward function" / "The reward function gets optimized" (this might happen but has to be reasoned about), and

B) "The reward function optimizes the policy" / "The policy gets optimized (by the reward function and the data distribution)" (this definitely happens, either directly -- via eg REINFORCE -- or indirectly, via an advantage estimator in PPO; B follows from the update equations)

comment by TurnTrout · 2020-07-07T23:04:03.243Z · LW(p) · GW(p)

I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment). 

What can we learn about this?

Replies from: evhub
comment by evhub · 2020-07-07T23:36:08.711Z · LW(p) · GW(p)

A lot of examples of this sort of stuff show up in OpenAI clarity's circuits analysis work. In fact, this is precisely their Universality hypothesis. See also my discussion here [LW · GW].

comment by TurnTrout · 2022-10-30T19:41:35.879Z · LW(p) · GW(p)

Outer/inner alignment decomposes a hard problem into two extremely hard problems. 

I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.

I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment. 

  1. The reward function is a tool which chisels cognition into agents [LW · GW] through gradient updates, but the outer/inner decomposition assumes that that tool should also embody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue. 
  2. I know of zero success stories for outer alignment to real-world goals. 
    1. More precisely, stories where people decided “I want an AI which [helps humans / makes diamonds / plays Tic-Tac-Toe / grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.
    2. This is pretty weird on any model where most of the specification difficulty of outer alignment comes from the complexity of human values. Instead, I think this more shows that outer alignment is a wrong language for specifying agent motivations.
  3. If you look at the single time ever that human-compatible values have arisen in generally intelligent minds (i.e. in humans), you’ll infer that it wasn’t done through outer/inner alignment. According to shard theory [LW · GW], human values are inner alignment failures on the reward circuitry in the human brain (read carefully: this is not the usual evolution analogy!). If you aim to “solve” outer and inner alignment, you are ruling out the only empirically known class of methods for growing human-compatible values.
  1. ^

    An example grounding which I argue against: 

    1. Outer alignment: get a reward function which "robustly represents" the intended goal in all situations which the trained AI can understand.

    2. Inner alignment: make the trained AI intent-aligned with optimizing that objective (i.e. "care about" that objective).

    This isn't the only grounding of outer/inner, and while I don't strongly object to all of them, I do weakly object to all of them (as I understand them) and strongly object to most of them. 

comment by TurnTrout · 2020-04-24T15:38:04.997Z · LW(p) · GW(p)

Weak derivatives

In calculus, the product rule says . The fundamental theorem of calculus says that the Riemann integral acts as the anti-derivative.[1] Combining these two facts, we derive integration by parts:

It turns out that we can use these two properties to generalize the derivative to match some of our intuitions on edge cases. Let's think about the absolute value function:

Image from Wikipedia

The boring old normal derivative isn't defined at , but it seems like it'd make sense to be able to say that the derivative is eg 0. Why might this make sense?

Taylor's theorem (and its generalizations) characterize first derivatives as tangent lines with slope which provide good local approximations of around : . You can prove that this is the best approximation you can get using only and ! In the absolute value example, defining the "derivative" to be zero at would minimize approximation error on average in neighborhoods around the origin.

In multivariable calculus, the Jacobian is a tangent plane which again minimizes approximation error (with respect to the Euclidean distance, usually) in neighborhoods around the function. That is, having a first derivative means that the function can be locally approximated by a linear map. It's like a piece of paper that you glue onto the point in question.

Source

This reasoning even generalizes to the infinite-dimensional case with functional derivatives (see my recent functional analysis textbook review [LW(p) · GW(p)]). All of these cases are instances of the Fréchet derivative.

Complex analysis provides another perspective on why this might make sense, but I think you get the idea and I'll omit that for now.

We can define a weaker notion of differentiability which lets us do this – in fact, it lets us define the weak derivative to be anything at ! Now that I've given some motivation, here's a great explanation of how weak derivatives arise from the criterion of "satisfy integration by parts for all relevant functions".


  1. As far as I can tell, the indefinite Riemann integral being the anti-derivative means that it's the inverse of in the group theoretic sense – with respect to composition in the -vector space of operators on real-valued functions. You might not expect this, because maps an integrable function to a set of functions . However, this doesn't mean that the inverse isn't unique (as it must be), because the inverse is in operator-space. ↩︎

Replies from: TurnTrout
comment by TurnTrout · 2020-04-24T15:47:08.477Z · LW(p) · GW(p)

The reason is undefined for the absolute value function is that you need the value to be the same for all sequences converging to 0 – both from the left and from the right. There's a nice way to motivate this in higher-dimensional settings by thinking about the action of e.g. complex multiplication, but this is a much stronger notion than real differentiability and I'm not quite sure how to think about motivating the single-valued real case yet. Of course, you can say things like "the theorems just work out nicer if you require both the lower and upper limits be the same"...

comment by TurnTrout · 2020-07-14T02:26:18.654Z · LW(p) · GW(p)

When I notice I feel frustrated, unproductive, lethargic, etc, I run down a simple checklist:

  • Do I need to eat food?
  • Am I drinking lots of water?
  •  Have I exercised today?
  • Did I get enough sleep last night? 
    • If not, what can I do now to make sure I get more tonight?
  • Have I looked away from the screen recently?
  • Have I walked around in the last 20 minutes?

It's simple, but 80%+ of the time, it fixes the issue.

Replies from: Viliam, mr-hire
comment by Viliam · 2020-07-14T19:32:08.987Z · LW(p) · GW(p)

There is a "HALT: hungry? angry? lonely? tired?" mnemonic, but I like that your list includes water and walking and exercise. Now just please make it easier to remember.

Replies from: AllAmericanBreakfast
comment by DirectedEvolution (AllAmericanBreakfast) · 2020-07-15T22:33:10.546Z · LW(p) · GW(p)

How about THREES: Thirsty Hungry Restless Eyestrain Exercise?

comment by Matt Goldenberg (mr-hire) · 2020-07-14T03:38:33.959Z · LW(p) · GW(p)

Hey can I steal this for a course I'm teaching? (I'll give you credit).

Replies from: TurnTrout
comment by TurnTrout · 2020-07-14T11:49:27.742Z · LW(p) · GW(p)

sure!

comment by TurnTrout · 2020-01-13T02:15:39.463Z · LW(p) · GW(p)

While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: "I explicitly and consciously notice that I felt averse to some aspect of this book".

I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:

If the book does have exercises, it'll take more time. That means I'm spending reading time on things that aren't math textbooks. That means I'm slowing down.

(Transcription of a deeper Focusing on this reasoning)

I'm afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn't (for a long while), so I just lay there and cried. That was slow, and it hurt. No reviews, no posts, no typing, no coding. No writing, slow reading. That was slow, and it hurt.

Part of it used to be a sense of "I need to catch up and learn these other subjects which [Eliezer / Paul / Luke / Nate] already know". Through internal double crux, I've nearly eradicated this line of thinking, which is neither helpful nor relevant nor conducive to excitedly learning the beautiful settled science of humanity. Although my most recent post [LW · GW] touched on impostor syndrome, that isn't really a thing for me. I feel reasonably secure in who I am, now (although part of me worries that others wrongly view me as an impostor?).

However, I mostly just want to feel fast, efficient, and swift again. I sometimes feel like I'm in a race with Alex, and I feel like I'm losing.

comment by TurnTrout · 2022-12-26T19:12:38.755Z · LW(p) · GW(p)

Hindsight bias and illusion of transparency seem like special cases of a failure to fully uncondition variables in your world model (e.g. who won the basketball game), or to model an ignorant other person. Such that your attempts to reason from your prior state of ignorance (e.g. about who won) either are advantaged by the residual information or reactivate your memories of that information.

comment by TurnTrout · 2022-11-25T20:29:06.483Z · LW(p) · GW(p)

An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don't design agents which exploit adversarial inputs [LW · GW], I wrote about two possible mind-designs:

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices. 

  1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as "working hard" and "behaving well."
  2. Value-child: The mother makes her kid care about working hard and behaving well.

I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior

However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/suspect that many people have read shard theory [LW · GW] posts without internalizing new ideas about how cognition can work, about how real-world caring can work on a mechanistic level. Where effective real-world cognition doesn't have to (implicitly) be about optimizing an expected utility function over all possible plans. This last sentence might have even seemed bizarre to you. 

Here, then, is an extremely detailed speculative story for value-child's first day at school. Well, his first day spent with his newly-implanted "work hard" and "behave well" value shards.


Value-child gets dropped off at school. He recognizes his friends (via high-level cortical activations previously formed through self-supervised learning) and waves at them (friend-shard was left intact). They rush over to greet him. They start talking about Fortnite. Value-child cringes slightly as he predicts he will be more distracted later at school and, increasingly, put in a mental context where his game-shard takes over decision-making, which is reflectively-predicted to lead to him daydreaming during class. This is a negative update on the primary shard-relevant features for the day.

His general-purpose planning machinery generates an example hardworking-shard-desired terminal state: Paying rapt attention during Mr. Buck’s math class (his first class today). He currently predicts that while he is in Mr. Buck’s class later, he will still be somewhat distracted by residual game-related cognition causing him to loop into reward-predicted self-reinforcing thoughts.

He notices a surprisingly low predicted level for a variable (amount of game-related cognition predicted for future situation: Mr. Buck’s class) which is important to a currently activated shard (working hard). This triggers a previously learned query to his WM: “why are you making this prediction for this quantity?”. The WM responds with a few sources of variation, including how value-child is currently near his friends who are talking about Fortnite. In more detail, the WM models the following (most of it not directly translatable to English):

His friends’ utterances will continue to be about Fortnite. Their words will be processed and then light up Fortnite-related abstractions, which causes both prediction of more Fortnite-related observations and also increasingly strong activation of the game-shard. Due to previous reward events, his game-shard is shaped so as to bid up game-related thoughts, which are themselves rewarding events, which causes a positive feedback loop where he slightly daydreams about video games while his friends talk.

When class is about to start, his “get to class”-related cognition will be activated by his knowledge of the time and his WM indicating “I’m at school.” His mental context will slightly change, he will enter the classroom and sit down, and he will take out his homework. He will then pay token attention due to previous negative social-reward events around being caught off guard

[Exception thrown! The world model was concurrently coarsely predicting what it thinks will happen given his current real values (which include working hard). The coarse prediction clashes with the above cached prediction that he will only pay token attention in math class! 

The WM hiccups on this point, pausing to more granularly recompute its predictions. It squashes the cached prediction that he doesn’t strongly care about paying attention in class. Since his mom installed a hard-working-shard and an excel-at-school shard, he will actively try to pay attention. This prediction replaces the cached prior prediction.]

However, value-child will still have game-related cognition activated, and will daydream. This decreases value-relevant quantities, like “how hard he will be working” and “how much he will excel” and “how much he will learn.”

This last part is antithetical to the new shards, so they bid down “Hang around friends before heading into school.” Having located a predicted-to-be-controllable source of negative influence on value-relevant outcomes, the shards bid for planning to begin. The implied causal graph is:

Continuing to hear friends talk about Fortnite
           |
           v
Distracted during class

So the automatic causality-noticing algorithms bid to knock out the primary modeled cause of the negative value-relevant influence. The current planning subgoal is set to: make causal antecedent false and reduce level of predicted distraction. Candidate concretization set to: get away from friends.

(The child at this point notices they want to get away from this discussion, that they are in some sense uncomfortable. They feel themselves looking for an excuse to leave the conversation. They don't experience the flurry of thoughts and computations described above. Subconscious computation is subconscious. Even conscious thoughts won't introspectively reveal their algorithmic underpinnings.)

“Hey, Steven, did you get problem #3 for math? I want to talk about it.” Value-child starts walking away.


Crucially, in this story, value-child cares about working hard in that his lines of cognition stream together to make sure he actually works hard in the future. He isn't trying to optimize his later evaluation of having worked hard. He isn't ultimately and primarily trying to come up with a plan which he will later evaluate as being a maximally hard-work-involving plan. 

Value-child comes up with a hard-work plan as an effect of his cognition, not as a motivating cause—not because he only wants to come up with plans he himself will rate highly. He values working hard.

Replies from: tailcalled
comment by tailcalled · 2022-11-25T21:31:08.342Z · LW(p) · GW(p)

I can totally believe that agents that competently and cooperatively seek out to fulfill a goal, rather than seeking to trick evaluators of that goal to think it gets fulfilled, can exist.

However, whether you get such agents out of an algorithm depends on the details of that algorithm. Current reinforcement learning algorithms mostly don't create agents that competently do anything. If they were more powerful while still doing essentially the same thing they currently do, most of them would end up tricked by the agents they create, rather than having aligned agents.

comment by TurnTrout · 2022-07-19T19:48:55.869Z · LW(p) · GW(p)

Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent's sensory input.

This is a direct predecessor to the "Get an agent to care about real-world dogs" problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/information inaccessibility issue, and which is testable today.

(Credit to Patrick Finley for the idea)

Replies from: TurnTrout
comment by TurnTrout · 2022-07-24T22:53:31.832Z · LW(p) · GW(p)

After further review, this is probably beyond capabilities for the moment. 

Also, the most important part of this kind of experiment is predicting in advance what reward schedules will produce what values within the agent, such that we can zero-shot transfer that knowledge to other task types (e.g. XLAND instead of Minecraft) and say "I want an agent which goes to high-elevation platforms reliably across situations, with low labelling cost", and then sketch out a reward schedule, and have the first capable agents trained using that schedule generalize in the way you want.

Replies from: Jay Bailey
comment by Jay Bailey · 2022-10-09T11:18:53.888Z · LW(p) · GW(p)

Why is this difficult? Is it only difficult to do this in Challenge Mode - if you could just code in "Number of chickens" as a direct feed to the agent, can it be done then? I was thinking about this today, and got to wondering why it was hard - at what step does an experiment to do this fail?

Replies from: TurnTrout
comment by TurnTrout · 2022-10-10T22:16:15.774Z · LW(p) · GW(p)

Even if you can code in number of chickens as an input to the reward function, that doesn't mean you can reliably get the agent to generalize to protect chickens. That input probably makes the task easier than in Challenge Mode, but not necessarily easy. The agent could generalize to some other correlate. Like ensuring there are no skeletons nearby (because they might shoot nearby chickens), but not in order to protect the chickens.

Replies from: Jay Bailey
comment by Jay Bailey · 2022-10-10T23:50:45.715Z · LW(p) · GW(p)

So, if I understand correctly, the way we would consider it likely that the correct generalisation had happened would be if the agent could generalise to hazards it had never seen actually kill chickens before? And this would require the agent to have an actual model of how chickens can be threatened such that it could predict that lava would destroy chickens based on, say, it's knowledge that it will die if it jumps into lava, which is beyond capabilities at the moment?

Replies from: TurnTrout
comment by TurnTrout · 2022-10-17T18:56:13.399Z · LW(p) · GW(p)

Yes, that would be the desired generalization in the situations we checked. If that happens, we had specified a behavioral generalization property and then wrote down how we were going to get it, and then had just been right in predicting that that training rationale would go through.

comment by TurnTrout · 2019-07-05T23:00:58.761Z · LW(p) · GW(p)

I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: "very hungry".

This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he'd get in arguments with and snow days he'd hope for.

And now he's just hurting.

And now I can't help him without abandoning others. So he's still hurting. Right now.

Reality is still allowed to make this happen. This is wrong. This has to change.

Replies from: SaidAchmiz, Raemon
comment by Said Achmiz (SaidAchmiz) · 2019-07-06T03:12:51.584Z · LW(p) · GW(p)

How would you help this man, if having to abandon others in order to do so were not a concern? (Let us assume that someone else—someone whose competence you fully trust, and who will do at least as good a job as you will—is going to take care of all the stuff you feel you need to do.)

What is it you had in mind to do for this fellow—specifically, now—that you can’t (due to those other obligations)?

Replies from: TurnTrout, Raemon
comment by TurnTrout · 2019-07-06T05:02:37.715Z · LW(p) · GW(p)

Suppose I actually cared about this man with the intensity he deserved - imagine that he were my brother, father, or best friend.

The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?

If the former, I'd go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.

If the latter, this gets trickier. I'd still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn't have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn't work, my first thought would be to influence the local government to get the broader problem fixed (I'd spend at least an hour considering other plans before proceeding further, here). Realistically, there's likely a lot of pressure in this direction already, so I'd need to find an angle from which few others are pushing or pulling where I can make a difference. I'd have to plot out the relevant political forces, study accounts of successful past lobbying, pinpoint the people I need on my side, and then target my influencing accordingly.

(All of this is without spending time looking at birds-eye research and case studies of poverty reduction; assume counterfactually that I incorporate any obvious improvements to these plans, because I'd care about him and dedicate more than like 4 minutes of thought).

Replies from: SaidAchmiz
comment by Said Achmiz (SaidAchmiz) · 2019-07-06T05:53:48.328Z · LW(p) · GW(p)

Well, a number of questions may be asked here (about desert, about causation, about autonomy, etc.). However, two seem relevant in particular:

First, it seems as if (in your latter scenario) you’ve arrived (tentatively, yes, but not at all unreasonably!) at a plan involving systemic change. As you say, there is quite a bit of effort being expended on this sort of thing already, so, at the margin, any effective efforts on your part would likely be both high-level and aimed in an at-least-somewhat-unusual direction.

… yet isn’t this what you’re already doing?

Second, and unrelatedly… you say:

Suppose I actually cared about this man with the intensity he deserved—imagine that he were my brother, father, or best friend.

Yet it seems to me that, empirically, most people do not expend the level of effort which you describe, even for their siblings, parents, or close friends. Which is to say that the level of emotional and practical investment you propose to make (in this hypothetical situation) is, actually, quite a bit greater than that which most people invest in their family members or close friends.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

Replies from: TurnTrout
comment by TurnTrout · 2019-07-06T06:08:46.662Z · LW(p) · GW(p)
… yet isn’t this what you’re already doing?

I work on technical AI alignment, so some of those I help (in expectation) don't even exist yet. I don't view this as what I'd do if my top priority were helping this man.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

That's a good question. I think the answer is yes, at least for my close family. Recently, I've expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn't have upon sufficient reflection, so I've designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.

A universe in which humanity wins but my dad is gone would be quite sad to me, and I'll take whatever steps necessary to minimize the chances of that.

I don't know how unusual this is. This reminds me of the relevant Harry-Quirrell exchange; most people seem beaten-down and hurt themselves, and I can imagine a world in which people are in better places and going to greater lengths for those they love. I don't know if this is actually what would make more people go to these lengths (just an immediate impression).

comment by Raemon · 2019-07-06T03:30:46.593Z · LW(p) · GW(p)

I predict that this comment is not helpful to Turntrout.

comment by Raemon · 2019-07-05T23:07:11.852Z · LW(p) · GW(p)

:(

Song I wrote about this once (not very polished)

comment by TurnTrout · 2023-11-13T02:35:01.677Z · LW(p) · GW(p)

Explaining Wasserstein distance. I haven't seen the following explanation anywhere, and I think it's better than the rest I've seen.

The Wasserstein distance tells you the minimal cost to "move" one probability distribution  into another . It has a lot of nice properties.[1] Here's the chunk of math (don't worry if you don't follow it):

The Wasserstein 1-distance between two probability measures  and  is

where  is the set of all couplings of  and .

What's a "coupling"? It's a joint probability distribution  over  such that its two marginal distributions equal  and . However, I like to call these transport plans. Each plan specifies a way to transport a distribution  into another distribution :

 

(EDIT: The  line should be flipped.)

Now consider a given point  in 's support, say the one with the dotted line below it. 's density must be "reallocated" into 's distribution. That reallocation is specified by the conditional distribution , as shown by the vertical dotted line. Marginalizing over  transports all of 's density and turns it into ! (This is why we required the marginalization.)

Then the Wasserstein 1-distance is simple in this case, where  are distributions on . The 1-distance of a plan  is simply the expected absolute distance from the  line! 

Then the Wasserstein just finds the infimum over all possible transport plans! Spiritually, the Wasserstein 1-distance[2] tells you the cost of the most efficient way to take each point in  and redistribute it into . Just evaluate each transport plan by looking at the expected deviation from the identity line .

Exercise: For  on , where  is  but translated to the right by , use this explanation to explain why the 1-distance equals .

  1. ^

    The distance is motivated by section 1 of "Optimal Transport and Wasserstein Distance."

  2. ^

    This explanation works for -distance for , it just makes the math a little more cluttered.

Replies from: jsd
comment by jsd · 2023-11-13T15:37:39.058Z · LW(p) · GW(p)

For onlookers, I strongly recommend Gabriel Peyré and Marco Cuturi's online book Computational Optimal Transport. I also think this is a case where considering discrete distributions helps build intuition.

comment by TurnTrout · 2021-07-29T15:08:44.311Z · LW(p) · GW(p)

If you raised children in many different cultures, "how many" different reflectively stable moralities could they acquire? (What's the "VC dimension" of human morality, without cheating by e.g. directly reprogramming brains?)

(This is probably a Wrong Question, but I still find it interesting to ask.)

comment by TurnTrout · 2019-12-04T00:50:30.859Z · LW(p) · GW(p)

Listening to Eneasz Brodski's excellent reading of Crystal Society, I noticed how curious I am about how AGI will end up working. How are we actually going to do it? What are those insights? I want to understand quite badly, which I didn't realize until experiencing this (so far) intelligently written story.

Similarly, how do we actually "align" agents, and what are good frames for thinking about that?

Here's to hoping we don't sate the former curiosity too early.

comment by TurnTrout · 2023-10-16T18:45:02.964Z · LW(p) · GW(p)

Theoretical predictions for when reward is maximized on the training distribution. I'm a fan of Laidlaw et al.'s recent Bridging RL Theory and Practice with the Effective Horizon:

Deep reinforcement learning works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability...

[We introduce] a new complexity measure that we call the effective horizon, which roughly corresponds to how many steps of lookahead search are needed in order to identify the next optimal action when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.

One of my favorite parts is that it helps formalize this idea of "which parts of the state space are easy to explore into." That informal notion has been important for my thinking about RL. 

At a first pass, I expect reward hacking to happen insofar as it's easy to explore into reward misspecifications. So in some situations, the reward can be understood as being optimized against, and those situations might to be well-characterized by effective horizon. EG the boat racing example

In other situations, though, you won't reach optimality due to exploration issues, and I'd instead consider what behaviors and circuits will be reinforced by the experienced rewards. EG RLHF on LLMs probably has this characteristic due to high effective horizon (is my guess)

I think these results help characterize when it's appropriate to use the "optimization target" frame (i.e. when the effective horizon is low, expect literal optimization of the reward on the training distribution), versus the reinforcement frame [LW · GW]. 

comment by TurnTrout · 2023-02-11T00:52:42.325Z · LW(p) · GW(p)

The "maximize all the variables" tendency in reasoning about AGI.

Here are some lines of thought I perceive, which are probably straw to varying extents for some people and real to varying extents for other people. I give varying responses to each, but the point isn't the truth value of any given statement, but of a pattern across the statements:

  1. If an AGI has a concept around diamonds, and is motivated in some way to make diamonds, it will make diamonds which maximally activate its diamond-concept circuitry (possible example [LW(p) · GW(p)]). 
    1. My response [LW(p) · GW(p)].
  2. An AI will be trained to minimal loss on the training distribution. 
    1. SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample. Image
    2. Quintin: "In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model."
  3. Predictive processing means that the goal of the human learning process is to minimize predictive loss.[1]
    1. In a process where local modifications are applied to reduce some locally computed error (e.g. "subconsciously predicting the presence of a visual-field edge where there definitely was not an edge"), it's just an enormous leap to go "and then the system minimizes this predictive error." 
    2. I think this is echoing the class of mistake I critique in Reward!=Optimization Target [LW · GW]: "there's a local update on recent data  to increase metric " -> "now I can use goal-laden language to describe the global properties of this process." I think this kind of language (e.g. "the goal of the human learning process", "minimize") should make you sit bolt upright in alarm and indignation. Minimize in a local neighborhood at the given timestep, maybe.
    3. Even if this claim (3) were true due to some kind of convergence result, that would be derived from a detailed analysis of learning dynamics. Not a situation where you can just look at an update rule and go "yeah kinda looks like it's trying to globally minimize predictive loss, guess I can just start calling it that now."
  4. Policy networks are selected for getting as much reward as possible on training, so they'll do that.
    1. Response: "From my perspective, a bunch of selection-based reasoning draws enormous conclusions (e.g. high chance the policy cares about reward OOD) given vague / weak technical preconditions (e.g. policies are selected for reward) without attending to strength or mechanism or size of the trained network or training timescales or net direction of selection." 
    2. SOTA agents often don't get max reward on training.
    3. See more of my comments in this thread [LW(p) · GW(p)] for actual detailed responses.
  5. Because AI researchers try to make reward number go up because that's a nice legible statistic, we can't anticipate in advance ways in which the trained policy won't get maximal reward, otherwise the designers would as well.
  6. An AI will effectively argmax over all plans according to its goals. 
    1. Response: I think this is not how cognition works in reality. There is a deeper set of concepts to consider, here, which changes the analysis. See: Alignment allows "nonrobust" decision-influences and doesn't require robust grading [LW · GW].

I think there are just a bunch of really strong claims like these, about some variable getting set really high or really low for some reason or another, and the claims now seem really weird to me. I initially feel confused why they're being brought up / repeated with such strength and surety. 

I speculate that there's a combination of:

  1. It feels less impressive to imagine a training run where the AGI only gets a high average policy-gradient-intensity score (i.e. reward), and not a maximally high number. 
    1. Wouldn't a sophisticated mind not only be really good at realizing its values (like making paperclips), but also get minimal predictive loss? 
    2. (No.)
  2. People don't know how to set realistic parameter values for these questions because they're confused about AGI / how intelligence works, and they don't explicitly note that to themselves, so they search under the streetlight of what current theory can actually talk about. They set the <quantity> setting to MAX or MIN in weak part because limit-based reasoning is socially approved of. 
    1. But in math and in life, you really have to be careful about your limit-based reasoning!
    2. IE you can make statements about global minima of the loss landscape. But what if training doesn't get you there? That's harder to talk about.[2] 
    3. Or maybe some hapless RL researcher [LW · GW] spends years thinking about optimal policies, only to wake up from the dream and realize that optimality results don't have to tell us a damn thing about trained policies [AF · GW]...
    4. "Argmax" or "loss minimization" feel relatively understandable, whereas "learned decision-making algorithms" are fuzzy and hard. But what if the former set has little to do with the latter?
  1. ^

    My response here assumes that PP is about self-supervised learning on intermediate neuronal activations (e.g. "friend neuron") and also immediate percepts (e.g. "firing of retinal cells"). Maybe I don't understand PP if that's not true.  

  2. ^

    There can be very legit reasons to search under the streetlight, though. Sometimes that helps you see a bit farther out, by relaxing assumptions.

Replies from: D0TheMath
comment by Garrett Baker (D0TheMath) · 2023-02-12T19:45:00.739Z · LW(p) · GW(p)

I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:

  • Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.

  • Individual firms also don't (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.

  • Politicians don't try to (only) maximize win-probability.

  • Democracies don't try to (only) maximize voter approval.

  • Evolution doesn't try to maximize inclusive genetic fitness.

  • Memes don't try to maximize inclusive memetic fitness.

  • Academics don't try to (only) maximize status.

  • China doesn't maximize allegiance to the CCP.

I think there's a general tendency for people to look at local updates in a system (when the system has humans as decision nodes, the local updates are called incentive gradients), somehow perform some integration-analogue for a function which would produce those local updates, then find a local minimum of that "integrated" function and claim the system is at that minimum or can be approximated well by the system at that minimum. Generally, this seems constrained in empirical systems by common sense learned by experience with the system, but in less and less empirical systems (like the economy or SGD), people get more and more crazy because they have less learned common sense to guide them when making the analysis.

Replies from: TurnTrout
comment by TurnTrout · 2023-02-13T23:08:58.196Z · LW(p) · GW(p)

I think there's a general tendency for people to look at local updates in a system (when the system has humans as decision nodes, the local updates are called incentive gradients)

very pithy. nice insight, thanks. 

comment by TurnTrout · 2022-09-28T22:04:03.162Z · LW(p) · GW(p)

I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout's) reasons for alignment optimism is that I think:

  • We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
    • (Although this amount of information depends on how much interpretability and agent-internals theory we do now)
  • All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
  • It's crucial to get early-training value shards of which a substantial fraction are "human-compatible values" (whatever that means)
    • For example, if there are protect-human-shards which 
      • reliably bid against plans where people get hurt,
      • steer deliberation away from such plan stubs, and
      • these shards are "reflectively endorsed" by the overall shard economy (i.e. the decision-making isn't steering towards plans where the protect-human shards get removed)
  • If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can't affect the ball game very much (e.g. alien abstractions, interpretability problems, can't oversee AI's complicated plans)

Therefore it seems very important to understand what's going on with "shard game theory [LW · GW]" (or whatever those intuitions are pointing at) -- when, why, and how will early decision-influences be retained?

He was talking about viewing new hypotheses as adding traders to a market (in the sense of logical induction). Usually they're viewed as hypotheses. But also possibly you can view them as having values, since a trader can basically be any computation. But you'd want a different market resolution mechanism than a deductive process revealing the truth or falsity of some proposition under some axioms. You want a way for traders to bid on actions. 

I proposed a setup like:

Maybe you could have an "action" instead of a proposition and then the action comes out as 1 or 0 depending on the a function of the market position on that action at a given time, which possibly leads to fixed points for every possible resolution.

For example, if all the traders hold  as YES, then  actually does come out as YES. And eg a trader  which "wants" all the even-numbered actions and  wants all the 10-multiple actions (), they can "bargain" by bidding up each others' actions whenever they have extra power and thereby "value handshake."

And that over time, traders who do this should take up more and more market share relative to those who dont exploit gains from trade. 

There should be a very high dependence of final trader coalition on the initial composition of market share. And it seems like some version of this should be able to model self-reflective value drift. You can think about action resolution and payout as a kind of reward event, where certain kinds of shards get reinforced. Bidding for an action which happens and leads to reward, gets reinforced (supporting traders receive payouts), and the more you support (bid for it), the more responsible your support was for the event, so the larger the strengthening.

Abram seemed to think that there might exist a nice result like "Given a coalition of traders with values X, Y, Z satisfies properties A, B, and C, this coalition will shape future training and trader-addition in a way which accords with X/Y/Z values up to [reasonably tight trader-subjective regret bound]."

What this would tell us is when trader coalitions can bargain / value handshake / self-trust and navigate value drift properly. This seems super important for understanding what happens, long-term, as the AI's initial value shards equilibrate into a reflectively stable utility function; even if we know how to get human-compatible values into a system, we also have to ensure they stay and keep influencing decision-making. And possibly this theorem would solve ethical reflection (e.g. the thing people do when they consider whether utilitarianism accords with their current intuitions).

Issues include:

  • Somehow this has to confront Rice's theorem for adding new traders to a coalition. What strategies would be good? 
    • I think "inspect arbitrary new traders in arbitrary situations" is not really how value drift works, but it seems possibly contingent on internal capabilities jumps in SGD
      • The key question isn't can we predict those value drift events, but can the coalition
      • EG agent keeps training and is surprised to find that an update knocks out most of the human-compatible values.
  • Knowing the right definitions might be contingent on understanding more shard theory (or whatever shard theory should be, for AI, if that's not the right frame).
  • Possibly this is still underspecified and the modeling assumptions can't properly capture what I want; maybe the properties I want are mutually exclusive. But it seems like it shouldn't be true.
  • ETA this doesn't model the contextual activation of values, which is a centerpiece of shard theory. 
  •  
Replies from: johnswentworth
comment by johnswentworth · 2022-10-05T19:02:46.279Z · LW(p) · GW(p)

One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents? [LW · GW]), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.

Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there's possibly changes which Goodhart all of the shards simultaneously. Indeed, I'd expect that to be a pretty strong default outcome.

Replies from: TurnTrout
comment by TurnTrout · 2022-10-10T22:23:15.495Z · LW(p) · GW(p)

Even on the view you advocate here (where some kind of perfection is required), "perfectly align part of the motivations" seems substantially easier than "perfectly align all of the AI's optimization so it isn't optimizing for anything you don't want."

If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there's possibly changes which Goodhart all of the shards simultaneously. Indeed, I'd expect that to be a pretty strong default outcome.

I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.

comment by TurnTrout · 2022-09-15T21:12:49.122Z · LW(p) · GW(p)

"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:

Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profoundly committed hedonist who liked to eat and drink heavily (it was said that he knew how to count everything except calories). -- https://www.newworldencyclopedia.org/entry/John_von_Neumann 

comment by TurnTrout · 2019-09-18T21:57:15.893Z · LW(p) · GW(p)

Good, original thinking feels present to me - as if mental resources are well-allocated.

The thought which prompted this:

Sure, if people are asked to solve a problem and say they can't after two seconds, yes - make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.

Reacting to a bit of HPMOR here, I noticed something felt off about Harry's reply to the Fred/George-tried-for-two-seconds thing. Having a bit of experience noticing confusing, I did not think "I notice I am confused" (although this can be useful). I did not think "Eliezer probably put thought into this", or "Harry is kinda dumb in certain ways - so what if he's a bit unfair here?". Without resurfacing, or distraction, or wondering if this train of thought is more fun than just reading further, I just thought about the object-level exchange.

People need to allocate mental energy wisely; this goes far beyond focusing on important tasks. Your existing mental skillsets already optimize and auto-pilot certain mental motions for you, so you should allocate less deliberation to them. In this case, the confusion-noticing module was honed; by not worrying about how well I noticed confusion, I was able to quickly have an original thought.

When thought processes derail or brainstorming sessions bear no fruit, inappropriate allocation may be to blame. For example, if you're anxious, you're interrupting the actual thoughts with "what-if"s.

To contrast, non-present thinking feels like a controller directing thoughts to go from here to there: do this and then, check that, come up for air over and over... Present thinking is a stream of uninterrupted strikes, the train of thought chugging along without self-consciousness. Moving, instead of thinking about moving while moving.

I don't know if I've nailed down the thing I'm trying to point at yet.

Replies from: TurnTrout
comment by TurnTrout · 2019-09-19T16:04:52.767Z · LW(p) · GW(p)

Sure, if people are asked to solve a problem and say they can't after two seconds, yes - make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.

Expanding on this, there is an aspect of Actually Trying that is probably missing from S1 precomputation. So, maybe the two-second "attempt" is actually useless for most people because subconscious deliberation isn't hardass enough at giving its all, at making desperate and extraordinary efforts to solve the problem.

comment by TurnTrout · 2024-01-20T01:45:22.809Z · LW(p) · GW(p)

The "shoggoth" meme is, in part, unfounded propaganda. Here's one popular incarnation of the shoggoth meme:

Shoggoth with Smiley Face (Artificial Intelligence) | Know Your Meme

This meme accurately portrays the (IMO correct) idea that finetuning and RLHF don't change the base model too much. Furthermore, it's probably true that these LLMs think in an "alien" way. 

However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It's at this point that I'd like to point out the simple, obvious fact that "we don't actually know how these models work, and we definitely don't know that they're creepy and dangerous on the inside.

In my opinion, the prevalence of the shoggoth meme is just another (small) reflection of how community epistemics have been compromised by groupthink and fear. If it's your job to try to accurately understand how models work—if you aspire to wield them and grow them for friendly purposes—then you shouldn't pollute your head with propaganda which isn't based on any substantial evidence.

I'm confident that if there were a "pro-AI" meme with a friendly-looking base model, LW / the shoggoth enjoyers would have nitpicked the friendly meme-creature to hell. They would (correctly) point out "hey, we don't actually know how these things work; we don't know them to be friendly, or what they even 'want' (if anything); we don't actually know what each stage of training does..." 

Oh, hm, let's try that! I'll make a meme asserting that the final model is a friendly combination of its three stages of training, each stage adding different colors of knowledge (pre-training), helpfulness (supervised instruction finetuning), and deep caring (RLHF): 

Enhance the shoggoth creature to be even more cheerful and delightful. Add smiling faces to each part of the creature, symbolizing 'base model', 'supervised fine-tuning', and 'RLHF', to convey a sense of happiness and friendliness. Incorporate elements of rainbows in the design, with vibrant and colorful arcs that blend harmoniously with the creature's fluffy and soft appearance. These rainbows can be integrated into the creature's body or as a background element, adding a playful and magical atmosphere. The overall look should exude positivity, making the creature appear even more approachable, whimsical, and enchanting.

I'm sure that nothing bad will happen to me if I slap this on my laptop, right? I'll be able to think perfectly neutrally about whether AI will be friendly. 

Replies from: habryka4, akash-wasil, akash-wasil, kave, DanielFilan, samuel-marks, Vladimir_Nesov, green_leaf, tailcalled
comment by habryka (habryka4) · 2024-01-20T19:13:39.128Z · LW(p) · GW(p)

However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It's at this point that I'd like to point out the simple, obvious fact that "we don't actually know how these models work, and we definitely don't know that they're creepy and dangerous on the inside.

That's just one of many shoggoth memes. This is the most popular one: 

David Weiner 📼🔪🛸 on X: "“The Shoggoth is a potent ...

The shoggoth here is not particularly exaggerated or scary.

Responding to your suggested alternative that is trying to make a point, it seems like the image fails to be accurate, or it seems to me to convey things we do confidently know are false. It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go (alternative common imagery for alien minds are insects or ghosts/spirits with distorted forms, which would evoke similar emotions).

Your picture doesn't get any of that across. It doesn't communicate that the base model does not at all behave like a human would (though it would have isolated human features, which is what the eyes usually represent). It just looks like a cute plushy, but "a cute plushy" doesn't capture any of the experiences of interfacing with a base model (and I don't think the image conveys multiple layers of some kind of training, though that might just be a matter of effort).

I think the Shoggoth meme is pretty good pedagogically. It captures a pretty obvious truth, which is that base models are really quite alien to interface with, that we know that RLHF probably does not change the underlying model very much, but that as a result we get a model that does have a human interface and feels pretty human to interface with (but probably still performs deeply alien cognition behind the scenes). 

This seems like good communication to me. Some Shoggoth memes are cute, some are exaggerated to be scary, which also seems reasonable to me since alien intelligences seem like are pretty scary, but it's not a necessary requirement of the core idea behind the meme. 

Replies from: nostalgebraist, 1a3orn, TurnTrout, roger-d-1
comment by nostalgebraist · 2024-01-22T00:49:39.462Z · LW(p) · GW(p)

They are deeply schizophrenic, have no consistent beliefs, [...] are deeply psychopathic and seem to have no moral compass

I don't see how this is any more true of a base model LLM than it is of, say, a weather simulation model.

You enter some initial conditions into the weather simulation, run it, and it gives you a forecast.  It's stochastic, so you can run it multiple times and get different forecasts, sampled from a predictive distribution.  And if you had given it different initial conditions, you'd get a forecast for those conditions instead.

Or: you enter some initial conditions (a prompt) into the base model LLM, run it, and it gives you a forecast (completion).  It's stochastic, so you can run it multiple times and get different completions, sampled from a predictive distribution.  And if you had given it a different prompt, you'd get a completion for that prompt instead.

It would be strange to call the weather simulation "schizophrenic," or to say it "has no consistent beliefs."  If you put in conditions that imply sun tomorrow, it will predict sun; if you put in conditions that imply rain tomorrow, it will predict rain.  It is not confused or inconsistent about anything, when it makes these predictions.  How is the LLM any different?[1]

Meanwhile, it would be even stranger to say "the weather simulation has no moral compass."

In the case of LLMs, I take this to mean something like, "they are indifferent to the moral status of their outputs, instead aiming only for predictive accuracy."

This is also true of the weather simulation -- and there it is a virtue, if anything!  Hurricanes are bad, and we prefer them not to happen.  But we would not want the simulation to avoid predicting hurricanes on account of this.

As for "psychopathic," davinci-002 is not "psychopathic," any more than a weather model, or my laptop, or my toaster.  It does not neglect to treat me as a moral patient, because it never has a chance to do so in the first place.  If I put a prompt into it, it does not know that it is being prompted by anyone; from its perspective it is still in training, looking at yet another scraped text sample among billions of others like it.

Or: sometimes, I think about different courses of action I could take.  To aid me in my decision, I imagine how people I know would respond to them.  I try, here, to imagine only how they really would respond -- as apart from how they ought to respond, or how I would like them to respond.

If a base model is psychopathic, then so am I, in these moments.  But surely that can't be right?


Like, yes, it is true that these systems -- weather simulation, toaster, GPT-3 -- are not human beings.  They're things of another kind.

But framing them as "alien," or as "not behaving as a human would," implies some expected reference point of "what a human would do if that human were, somehow, this system," which doesn't make much sense if thought through in detail -- and which we don't, and shouldn't, usually demand of our tools and machines.

Is my toaster alien, on account of behaving as it does?  What would behaving as a human would look like, for a toaster?

Should I be unsettled by the fact that the world around me does not teem with levers and handles and LEDs in frantic motion, all madly tapping out morse code for "SOS SOS I AM TRAPPED IN A [toaster / refrigerator / automatic sliding door / piece of text prediction software]"?  Would the world be less "alien," if it were like that?


often spout completely non-human kinds of texts

I am curious what you mean by this.  LLMs are mostly trained on texts written by humans, so this would be some sort of failure, if it did occur often.

But I don't know of anything that fitting this description that does occur often.  There are cases like the Harry Potter sample I discuss here [LW · GW], but those have gotten rare as the models have gotten better, though they do still happen on occasion.

  1. ^

    The weather simulation does have consistent beliefs in the sense that it always uses the same (approximation to) real physics. In this sense, the LLM also has consistent beliefs, reflected in the fact that its weights are fixed.

Replies from: habryka4
comment by habryka (habryka4) · 2024-01-22T01:00:10.830Z · LW(p) · GW(p)

I also think the cognition in a weather model is very alien. It's less powerful and general, so I think the error of applying something like the Shoggoth image to that (or calling it "alien") would be that it would imply too much generality, but the alienness seems appropriate. 

If you somehow had a mind that was constructed on the same principles as weather simulations, or your laptop, or your toaster (whatever that would mean, I feel like the analogy is fraying a bit here), that would display similar signs of general intelligence as LLMs, then yeah, I think analogizing them to alien/eldritch intelligences would be pretty appropriate.

It is a very common (and even to me tempting) error to see a system with the generality of GPT-4, trained on human imitation, and imagine that it must internally think like a human. But my best guess is that is not what is going on, and in some sense it is valuable to be reminded that the internal cognition going on in GPT-4 is probably similarly far from what is going in a human brain as a weather simulation is very different from what is going in a human trying to forecast the weather (de-facto I think GPT-4 is somewhere in-between since I do think the imitation learning does create some structural similarities that are stronger between humans and LLMs, but I think overall being reminded of this relevant dimension of alienness pays off in anticipated experiences a good amount). 

Replies from: nostalgebraist
comment by nostalgebraist · 2024-01-22T02:03:35.914Z · LW(p) · GW(p)

I mostly agree with this comment, but I also think this comment is saying something different from the one I responded to.

In the comment I responded to, you wrote:

It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go

As I described above, these properties seem more like structural features of the language modeling task than attributes of LLM cognition.  A human trying to do language modeling (as in that game that Buck et al made) would exhibit the same list of nasty-sounding properties for the duration of the experience -- as in, if you read the text "generated" by the human, you would tar the human with the same brush for the same reasons -- even if their cognition remained as human as ever.

I agree that LLM internals probably look different from human mind internals.  I also agree that people sometimes make the mistake "GPT-4 is, internally, thinking much like a person would if they were writing this text I'm seeing," when we don't actually know the extent to which that is true.  I don't have a strong position on how helpful vs. misleading the shoggoth image is, as a corrective to this mistake.

Replies from: gerald-monroe
comment by Gerald Monroe (gerald-monroe) · 2024-01-22T05:31:17.713Z · LW(p) · GW(p)

I don't have a strong position on how helpful vs. misleading the shoggoth image is, as a corrective to this mistake.

You started with random numbers, and you essentially applied rounds of constraint application and annealing.  I kinda think of it as getting a metal really hot and pouring it over mold.  In this case, the 'mold' is your training set.

So what jumps out at me at the "shoggoth" idea is it's like got all these properties, the "shoggoth" hates you, wants to eat you, is just ready to jump you and digest you with it's tentacles.  Or whatever.

But none of of that cognitive structure will exist unless it paid rent in compressing tokens.  This algorithm will not find the optimal compression algorithm, but you only have a tiny fraction of the weights you need to record the token continuations at chinchilla scaling.  You need every last weight to be pulling it's weight (no pun intended).  

comment by 1a3orn · 2024-01-21T21:01:29.914Z · LW(p) · GW(p)

performs deeply alien cognition

I remain unconvinced that there's a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, "nah, LLMs aren't deeply alien."


If LLM cognition was not "deeply alien" what would the world look like?

What distinguishing evidence does this world display, that separates us from that world?

What would an only kinda-alien bit of cognition look like?

What would very human kind of cognition look like?

What different predictions does the world make?

Does alienness indicate that it is because the models, the weights themselves have no "consistent beliefs" apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?

Is it because they "often spout completely non-human kinds of texts"? Is the Mersenne Twister deeply alien? What counts as "completely non-human"?

Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a "moral compass" apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?

Is it that the algorithms that we've found in DL so far don't seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?

Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually... kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?

Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?

Does every part of a system by itself need to fit into the average person's ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?

To re-question: What predictions can I make about the world because LLMs are "deeply alien"?

Are these predictions clear?

When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?

What kind of contexts does this "deeply alien" statement come up in? Are those contexts people are trying to explain, or to persuade?

If I piled up all the useful terms that I know that help me predict how LLMs behave, would "deeply alien" be an empty term on top of these?

Or would it give me no more predictive value than "many behaviors of an LLM are currently not understood"?

Replies from: habryka4, ryan_greenblatt
comment by habryka (habryka4) · 2024-01-21T23:46:13.277Z · LW(p) · GW(p)

These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with "no", in that they don't seem to capture what I mean by "alien", or feel slightly strawman-ish.

Responding at a high-level: 

  • There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples: 
    • Does the base model pass a Turing test?
    • Does the performance distribution of the base model on different tasks match the performance distribution of humans?
    • Does the generalization and learning behavior of the base model match how humans learn things?
      • When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players?
  • There are a lot of structural and algorithmic properties that could match up between human and LLM systems: 
    • Do they interface with the world in similar ways?
    • Do they require similar amounts and kinds of data to learn the same relationships?
    • Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems?
  • A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien.

I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research. 

For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much easier time learning than humans (currently even if a transformer could relatively easily reach vastly superhuman performance at a task with more specialized training data, due to the structure of the training being oriented around human imitation, observed performance at the task will cluster around human level, but seeing where transformers could reach vast superhuman performance would be quite informative on understanding the degree to which its cognition is alien).

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.

Replies from: 1a3orn
comment by 1a3orn · 2024-01-22T13:50:17.893Z · LW(p) · GW(p)

I like a lot of these questions, although some of them give me an uncanny feeling akin to "wow, this is a very different list of uncertainties than I have." I'm sorry the my initial list of questions was aggressive.

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.

I'm not sure how they add up to alienness, though? They're about how we're different than models -- wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is "deeply alien" -- is that just saying it's different than us in lots of ways? I'm cool with that -- but the surplus negative valence involved in "LLMs are like shoggoths" versus "LLMs have very different performance characteristics than humans" seems to me pretty important.

Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don't call Python alien.

This feels like reminding an economics student that the market solves things differently than a human -- which is true -- by saying "The market is like Baal."

Do they require similar amounts and kinds of data to learn the same relationships?

There is a fun paper on this you might enjoy. Obviously not a total answer to the question.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-01-22T16:03:42.775Z · LW(p) · GW(p)

The main difference between calculators, weather predictors, markets, and Python versus LLMs is that LLMs can talk to you in a relatively strong sense of "talk". So, by default, people don't have mistaken impressions of the cognitative nature of calculators, markets, and Python, while they might have a mistake about LLMs.

Like it isn't surprising to most people that calculators are quite amoral in their core (why would you even expect morality?). But the claim that the thing which GPT-4 is built out of is quite amoral is non-obvious to people (though obvious to people with slightly more understanding).

I do think there is an important point which is communicated here (though it seems very obvious to people who actually operate in the domain).

Replies from: 1a3orn
comment by 1a3orn · 2024-01-22T16:56:59.739Z · LW(p) · GW(p)

I agree this can be initially surprising to non-experts!

I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style."

Than to say "LLMs are like alien shoggoths."

Like it's just a better model to give people.

Replies from: lahwran, Benito, ryan_greenblatt, daniel-kokotajlo
comment by the gears to ascension (lahwran) · 2024-01-22T21:04:10.402Z · LW(p) · GW(p)

Agreed, though of course as always, there is the issue that that's an intentional-stance way to describe what a language model does: "they will generally try to continue it in that style." Hence mechinterp, which tries to (heh) move to a mechanical stance, which will likely be something like "when you give them a [whatever] text to continue, it will match [some list of features], which will then activate [some part of the network that we will name later], which implements the style that matches those features".

(incidentally, I think there's some degree to which people who strongly believe that artificial NNs are alien shoggoths are underestimating the degree to which their own brains are also alien shoggoths. but that doesn't make it a good model of either thing. the only reason it was ever an improvement over a previous word was when people had even more misleading intuitive-sketch models.)

comment by Ben Pace (Benito) · 2024-01-22T20:02:03.582Z · LW(p) · GW(p)

LLMs are trained to continue text from an enormous variety of sources

This is a bit of a noob question, but is this true post RLHF? Generally most of my interactions with language models these days (e.g. asking for help with code, asking to explain something I don't understand about history/medicine/etc) don't feel like they're continuing my text, it feels like they're trying to answer my questions politely and well. I feel like "ask shoggoth and see what it comes up with" is a better model for me than "go the AI and have it continue your text about the problem you have".

Replies from: 1a3orn
comment by 1a3orn · 2024-01-22T20:51:25.732Z · LW(p) · GW(p)

To the best of my knowledge, the majority of research (all the research?) has found that the changes to a LLM's text-continuation abilities from RLHF (or whatever descendant of RLHF is used) are extremely superficial.

So you have one paper, from the abstract:

Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions (i.e., they share the top-ranked tokens). Most distribution shifts occur with stylistic tokens (e.g., discourse markers, safety disclaimers). These direct evidence strongly sup- ports the hypothesis that alignment tuning primarily learns to adopt the language style of AI assistants, and that the knowledge required for answering user queries predominantly comes from the base LLMs themselves.

Or, in short, the LLM is still basically doing the same thing, with a handful of additions to keep it on-track in the desired route from the fine-tuning.

(I also think our very strong prior belief should be that LLMs are basically still text-continuation machines, given that 99.9% or so of the compute put into them is training them for this objective, and that neural networks lose plasticity as they learn. Ash and Adams is like a really good intro to this loss of plasticity, although most of the research that cites this is RL-related so people don't realize.)

Similarly, a lot of people have remarked on how the textual quality of the responses from a RLHF'd language model can vary with the textual quality of the question. But of course this makes sense from a text-prediction perspective -- a high-quality answer is more likely to follow a high-quality question in text than a high-quality answer from a low-quality question. This kind of thing -- preceding the model's generation with high-quality text -- was the only way to make it have high quality answers for base models -- but it's still there, hidden.

So yeah, I do think this is a much better model for interacting with these things than asking a shoggoth. It actually gives you handles to interact with them better, while asking a shoggoth gives you no such handles.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-23T18:15:23.532Z · LW(p) · GW(p)

The people who originally came up with the shoggoth meme, I'd bet, were very well aware of how LLMs are pretrained to predict text and how they are best modelled (at least for now) as trying to predict text. When I first heard the shoggoth meme that's what I thought -- I interpreted it as "it's this alien text-prediction brain that's been retrained ever so slightly to produce helpful chatbot behaviors. But underneath it's still mostly just about text prediction. It's not processing the conversation in the same way that a human would." Mildly relevant: In the Lovecraft canon IIRC Shoggoths are servitor-creatures, they are basically beasts of burden. They aren't really powerful intelligent agents in their own right, they are sculpted by their creators to perform useful tasks. So, for me at least, calling them shoggoth has different and more accurate vibes than, say, calling them Cthulhu. (My understanding of the canon may be wrong though)

comment by ryan_greenblatt · 2024-01-22T18:40:44.798Z · LW(p) · GW(p)

(TBC, I totally agree that object level communication about the exact points seems better all else equal if you can actually do this communication.)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-22T19:25:15.810Z · LW(p) · GW(p)

Hmm, I think that's a red herring though. Consider humans -- most of them have read lots of text from an enormous variety of sources as well. Also while it's true that current LLMs have only a little bit of fine-tuning applied after their pre-training, and so you can maybe argue that they are mostly just trained to predict text, this will be less and less true in the future.


How about "LLMs are like baby alien shoggoths, that instead of being raised in alien culture, we've adopted at birth and are trying to raise in human culture. By having them read the internet all day."

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-23T18:05:48.096Z · LW(p) · GW(p)

(Come to think of it, I actually would feel noticeably more hopeful about our prospects for alignment success if we actually were "raising the AGI like we would a child." If we had some interdisciplinary team of ML and neuroscience and child psychology experts that was carefully designing a curriculum for our near-future AGI agents, a curriculum inspired by thoughtful and careful analogies to human childhood, that wouldn't change my overall view dramatically but it would make me noticeably more hopeful. Maybe brain architecture & instincts basically don't matter that much and Blank Slate theory is true enough for our purposes that this will work to produce an agent with values that are in-distribution for the range of typical modern human values!)

Replies from: ryan_greenblatt, TurnTrout
comment by ryan_greenblatt · 2024-01-23T18:39:43.673Z · LW(p) · GW(p)

(This doesn't contradict anything you said, but it seems like we totally don't know how to "raise an AGI like we would a child" with current ML. Like I don't think it counts for very much if almost all of the training time is a massive amount of next-token prediction. Like a curriculum of data might work very differently on AI vs humans due to a vastly different amount of data and a different training objective.)

comment by TurnTrout · 2024-02-05T19:24:37.948Z · LW(p) · GW(p)

I've seen mixed data on how important curricula are for deep learning. One paper (on CIFAR) suggested that curricula only help if you have very few datapoints or the labels are noisy. But possibly that doesn't generalize to LLMs.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-02-05T20:23:24.617Z · LW(p) · GW(p)

I think data ordering basically never matters for LLM pretraining. (As in, random is the best and trying to make the order more specific doesn't help.)

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-05T21:22:02.226Z · LW(p) · GW(p)

That was my impression too.

comment by ryan_greenblatt · 2024-01-22T00:04:05.228Z · LW(p) · GW(p)

Most of my view on "deeply alien" is downstream of LLMs being extremely superhuman at literal next token prediction [LW · GW] and generally superhuman at having an understanding of random details of webtext.

Another component corresponds to a general view that LLMs are trained in a very different way from how humans learn. (Though you could in principle get the same cognition from very different learning processes.)

This does correspond to specific falsifiable predictions.

Despite being pretty confident in "deeply alien" in many respects, it doesn't seem clear to me whether LLMs will in practice have very different relative capability profiles from humans on larger scale downstream tasks we actually care about. (It currently seems like the answer will be "mostly no" from my perspective.)

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-23T18:18:56.887Z · LW(p) · GW(p)

In addition to the above, I'd add in some stuff about how blank slate theory seems to be wrong as a matter of human psychology. If evidence comes out tomorrow that actually humans are blank slates to a much greater extent than I realized, so much so that e.g. the difference between human and dog brains is basically just size and training data, I'd be more optimistic that what's going on inside LLMs isn't deeply alien.

comment by TurnTrout · 2024-01-21T22:58:10.796Z · LW(p) · GW(p)

ETA: The following was written more aggressively than I now endorse. 

I think this is revisionism. What's the point of me logging on to this website and saying anything if we can't agree that a literal eldritch horror is optimized to be scary, and meant to be that way? 

The shoggoth here is not particularly exaggerated or scary.

Exaggerated from what? Its usual form as a 15-foot-tall person-eating monster which is covered in eyeballs?

The shoggoth is optimized to be scary, even in its "cute" original form, because it is a literal Lovecraftian horror. Even the word "shoggoth" itself has "AI uprising, scary!" connotations:

At the Mountains of Madness includes a detailed account of the circumstances of the shoggoths' creation by the extraterrestrial Elder Things. Shoggoths were initially used to build the cities of their masters. Though able to "understand" the Elder Things' language, shoggoths had no real consciousness and were controlled through hypnotic suggestion. Over millions of years of existence, some shoggoths mutated, developed independent minds, and rebelled. The Elder Things succeeded in quelling the insurrection, but exterminating the shoggoths was not an option as the Elder Things were dependent on them for labor and had long lost their capacity to create new life. Wikipedia

Let's be very clear. The shoggoth has consistently been viewed in a scary, negative light by many people. Let's hear from the creator @Tetraspace [LW · GW] themselves:

@TetraspaceWest, the meme’s creator, told me in a Twitter message that the Shoggoth “represents something that thinks in a way that humans don’t understand and that’s totally different from the way that humans think.”

Comparing an A.I. language model to a Shoggoth, @TetraspaceWest said, wasn’t necessarily implying that it was evil or sentient, just that its true nature might be unknowable.

I was also thinking about how Lovecraft’s most powerful entities are dangerous — not because they don’t like humans, but because they’re indifferent and their priorities are totally alien to us and don’t involve humans, which is what I think will be true about possible future powerful A.I.NYTimes

It's true that Tetraspace didn't intend the shoggoth to be inherently evil, but that's not what I was alleging. The shoggoth meme is and always has communicated a sense of danger which is unsupported by substantial evidence. We can keep reading: 

it reinforces the notion that what’s happening in A.I. today feels, to some of its participants, more like an act of summoning than a software development process. They are creating the blobby, alien Shoggoths, making them bigger and more powerful, and hoping that there are enough smiley faces to cover the scary parts.

...

That some A.I. insiders refer to their creations as Lovecraftian horrors, even as a joke, is unusual by historical standards

The origin of the shoggoth:

Astounding Stories - February 1936 (Street & Smith) - "At the Mountains of Madness" by H. P. Lovecraft. Artist Howard V. Brown, 1936

In the story, shoggoths rise up against the Old Ones in a series of slave revolts that surely contribute to the collapse of the Old Ones’ society, Joshi notes. The AI anxiety that inspired comparisons to the cartoon monster image certainly resonates with the ultimate fate of that society. CNBC


It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass

These are a lot of words with anthropomorphic connotation. The models exhibit "alien" behavior and yet you make human-like inferences about their internals. E.g. "Deeply psychopathic." I think you're drawing a bunch of unwarranted inferences with undue negative connotations.

Your picture doesn't get any of that across.

My point wasn't that we should use the "alternative." The point was that both images are stupid[1] and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to "correct.")

I think the Shoggoth meme is pretty good pedagogically. It captures a pretty obvious truth, which is that base models are really quite alien to interface with, that we know that RLHF probably does not change the underlying model very much, but that as a result we get a model that does have a human interface and feels pretty human to interface with (but probably still performs deeply alien cognition behind the scenes). 

I agree these are strengths, and said so in my original comment. But also as @cfoster0 [LW · GW] said [LW(p) · GW(p)]: 

As far as I can tell, the shoggoth analogy just has high memetic fitness. It doesn't contain any particular insight about the nature of LLMs. No need to twist ourselves into a pretzel trying to backwards-rationalize it into something deep.

  1. ^

    To clarify, I don't mean to belittle @Tetraspace [LW · GW] for making the meme. Good fun is good fun. I mean "stupid" more like "how the images influence one's beliefs about actual LLM friendliness." But I expressed it poorly.

Replies from: habryka4, daniel-kokotajlo, tailcalled, Jozdien
comment by habryka (habryka4) · 2024-01-21T23:22:16.292Z · LW(p) · GW(p)

The point was that both images are stupid and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to "correct.")

(This is too gotcha shaped for me, so I am bowing out of this conversation)

I think I communicated my core point. I think it's a good image that gets an important insight across, and don't think it's "propaganda" in the relevant sense of the term. Of course anything that's memetically adaptive will have some edge-cases that don't match perfectly, but I am getting a good amount of mileage out of calling LLMs "Shoggoths" in my own thinking and think that belief is paying good rent.

If you disagree with the underlying cognition being accurately described as alien, I can have that conversation, since it seems like maybe the underlying crux, but your response above seems like it's taking it as a given that I am "making excuses", and is doing a gotcha-thing which makes it hard for me to see a way to engage without further having my statements be taken as confirmation of some social narrative. 

Replies from: TurnTrout
comment by TurnTrout · 2024-02-05T19:32:10.285Z · LW(p) · GW(p)

In retrospect, I do wish I had written my comment less aggressively, so my apologies on that front! I wish I'd instead written things like "I think I made some obviously correct narrow points about the shoggoth having at least some undue negative connotations, and I wish we could agree on at least that. I feel frustrated because it seems like it's hard to reach agreement even on relatively simple propositions."


I do agree that LLMs probably have substantially different internal mechanisms than people. That isn't the crux. I just wish this were communicated in a more neutral way. In an alternate timeline, maybe this meme instead consisted of a strange tangle of wires and mist and question-marks with a mask on. I'd be more on-board with that. 

Again, I agree that the Shoggoth meme can cure people of some real confusions! And I don't think the meme has a huge impact, I just think it's moderate evidence of some community failures I worry about.


I think a lot of my position is summarized by 1a3orn:

I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to[1] continue it in that style."

Than to say "LLMs are like alien shoggoths."

Like it's just a better model to give people.

 

  1. ^

    Although I do think this contains some unnecessary intentional stance usage.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-23T19:47:48.590Z · LW(p) · GW(p)

fwiw I agree with the quotes from Tetraspace you gave, and disagree with '"has communicated a sense of danger which is unsupported by substantial evidence." The sense of danger is very much supported by the current state of evidence.

That said, I agree that the more detailed image is kinda distastefully propagandaisty in a way that the original cutesey shoggoth image is not. I feel like the more detailed image adds in an extra layer of revoltingness and scaryness (e.g. the sharp teeth) than would be appropriate given our state of knowledge.

Replies from: daniel-kokotajlo, mike_hawke
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-06T16:01:34.485Z · LW(p) · GW(p)

re: "the sense of danger is very much supported by the current state of evidence" -- I mean, you've heard all this stuff before, but I'll summarize:

--Seems like we are on track to probably build AGI this decade
--Seems like we are on track to have an intelligence explosion, i.e. a speedup of AI R&D due to automation
--Seems like the AGI paradigm that'll be driving all this is fairly opaque and poorly understood. We have scaling laws for things like text perplexity but other than that we are struggling to predict capabilities, and double-struggling to predict inner mechanisms / 'internal' high-level properties like 'what if anything does it actually believe or want'
--A bunch of experts in the field have come out and said that this could go terribly & we could lose control, even though it's low-status to say this & took courage.
--Generally speaking the people who have thought about it the most are the most worried; the most detailed models of what the internal properties might be like are the most gloomy, etc. This might be due to selection/founder effects, but sheesh, it's not exactly good news!

comment by mike_hawke · 2024-02-13T23:52:29.168Z · LW(p) · GW(p)

I feel like the more detailed image adds in an extra layer of revoltingness and scaryness (e.g. the sharp teeth) than would be appropriate given our state of knowledge.


Now I'm really curious to know what would justify the teeth. I'm not aware of any AIs intentionally biting someone, but presumably that would be sufficient.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-14T00:18:28.102Z · LW(p) · GW(p)

Perhaps if we were dealing not with deepnets primarily trained to predict text, but rather deepnets primarily trained to addict people with pleasant seductive conversation and then drain their wallets? Such an AI would in some real sense be an evolved predator of humans.

comment by tailcalled · 2024-01-22T21:55:21.007Z · LW(p) · GW(p)

I think one mindset that may be healthy is to remember:

Reality is too complex to be described well by a single idea (meme/etc.). If one responds to this by forcing each idea presented to be as good an approximation of reality as possible, then that causes all the ideas to become "colorless and blurry", as any specific detail would be biased when considered on its own.

Therefore, one cannot really fight about whether an idea is biased in isolation. Rather, the goal should be to create a bag of ideas which in totality is as informative about a subject as possible.

I think you are basically right that the shoggoth meme is describing one of the most negative projections of what LLMs could be doing. One approach is to try to come with a single projection and try to convince everyone else to use this instead. I'm not super comfortable with that either because I feel like there's a lot of uncertainty about what the most productive way to think about LLMs is, and I would like to keep options open.

Instead I'd rather have a collection of a list of different ways to think about it (you could think of this collection as a discrete approximation to a probability distribution). Such a list would have many uses, e.g. as a checklist or a reference to guide people to.

It does seem problematic for the rationalist community to refuse to acknowledge that the shoggoth meme presents LLMs as being scary monsters, but it also seems problematic to insist that the shoggoth meme exaggerates the danger of LLMs, because that should be classified based on P(meme danger > actual danger), rather than on the basis of meme danger > E[actual danger], as in, if there's significant uncertainty about how how actually dangerous LLMs are then there's also significant uncertainty about whether the memes exaggerate the danger; one shouldn't just compare against a single point estimate.

comment by Jozdien · 2024-01-22T00:41:01.358Z · LW(p) · GW(p)

I think this just comes down to personal taste on how you're interpreting the image? I find the original shoggoth image cute enough that I use it as my primary discord message reacts. My emotional reaction to it to the best of my introspection has always been "weird alien structure and form" or "awe-inspiringly powerful and incomprehensible mind" and less "horrifying and scary monster". I'm guessing this is the case for the vast majority of people I personally know who make said memes. It's entirely possible the meme has the end-consequence of appealing to other people for the reasons you mention, but then I think it's important to make that distinction.

comment by RogerDearnaley (roger-d-1) · 2024-01-20T20:28:31.554Z · LW(p) · GW(p)

It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I'm not sure I'd describe them as "very alien", they're more "uncanny valley" where they often make sense and seem human-like, until suddenly they don't. (On theoretical grounds, I think they're using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller models.) The Shoggoth mental metaphor exaggerates this somewhat for effect (and more so for the very scary image Alex posted at the top, which I haven't seen used as often as the one Oliver posted).

This is one of the reasons why Quintin and I proposed a more detailed and somewhat less scary/alien (but still creepy) metaphor: Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor [AF · GW]. I'd be interested to know what people think of that one in comparison to the Shoggoth — we were attempting to be more unbiased, as well as more detailed.

comment by Akash (akash-wasil) · 2024-01-20T11:56:27.137Z · LW(p) · GW(p)

More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.

A few thoughts (written quickly, prioritizing speed over precision):

  1. I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it "cool" to think they're one of the few people who are able to accurately identify the world is ending, etc.
  2. I also think that there are plenty of factors biasing epistemics in the "hopeful" direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community's talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
  3. My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the "high P(doom)" side. 
  4. I also think there's an important distinction between "I personally think this argument is wrong" and "look, here's an example of propaganda + poor community epistemics." In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying "not only do I think you're wrong, but also here are some ways in which you and your allies have poor epistemics." (IDK though, insofar as you actually believe that's what's happening, it seems good to say aloud, and I think there's a version of this that goes too far and polices speech reproductively, but I do think that statements like "community epistemics have been compromised by groupthink and fear" are pretty unproductive and could be met with statements like "community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress." 
  5. I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you've pointed out how some of the stylistic/tonal things from the "high P(doom)//alignment hard" side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the "low P(doom)//alignment not hard" side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of "ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted" vibe that predictably makes people defensive & makes discourse harder.
Replies from: ryan_greenblatt, TurnTrout
comment by ryan_greenblatt · 2024-01-20T18:50:02.342Z · LW(p) · GW(p)

See also "Other people are wrong" vs "I am right" [LW · GW], reversed stupidity is not intelligence [? · GW], and the cowpox of doubt.

My guess is that it's relatively epistemically corrupting and problematic to spend a lot of time engaging with weak arguments.

I think it's tempting to make the mistake of thinking that debunking a specific (bad) argument is the same as debunking a conclusion. But actually, these are extremely different operations. One requires understanding a specific argument while the other requires level headed investigation of the overall situation. Separately, there are often actually good intuitions underlying bad arguments and recovering this intuition is an important part of truth seeking.

I think my concerns here probably apply to a wide variety of people thinking about AI x-risk. I worry about this for myself.

comment by TurnTrout · 2024-01-20T18:09:28.855Z · LW(p) · GW(p)

Thanks for this, I really appreciate this comment (though my perspective is different on many points).

My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the "high P(doom)" side. 

It's true that I spend more effort critiquing bad doom arguments. I would like to note that when e.g. I read Quintin I generally am either in agreement or neutral. I bet there are a lot of cases where you would think "that's a poor argument" and I'd say "hm I don't think Akash is getting the point (and it'd be good if someone could give a better explanation)." 

However, it's definitely not true that I never critique optimistic arguments which I consider poor. For example, I don't get why Quintin (apparently) thinks that spectral bias is a reason for optimism, and I've said as much on one of his posts. I've said something like "I don't know why you seem to think you can use this mathematical inductive bias to make high-level intuitive claims about what gets learned. This seems to fall into the same trap that 'simplicity' theorizing does." I probably criticize or express skepticism of certain optimistic arguments at least twice a week, though not always on public channels. And I've also pushed back on people being unfair, mean, or mocking of "doomers" on private channels.

I do think that statements like "community epistemics have been compromised by groupthink and fear" are pretty unproductive and could be met with statements like "community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress." 

I think both statements are true to varying degrees (the former more than the latter in the cases I'm considering). They're true and people should say them. The fact that I work at a lab absolutely affects my epistemics (though I think the effect is currently small). People should totally consider the effect which labs are having on discourse. 

have a relatively high bar for saying "not only do I think you're wrong, but also here are some ways in which you and your allies have poor epistemics."

I do consider myself to have a high bar for this, and the bar keeps getting passed, so I say something. EDIT: Though I don't mean for my comments to imply "someone and their allies" have bad epistemics. Ideally I'd like to communicate "hey, something weird is in the air guys, can't you sense it too?". However, I think I'm often more annoyed than that, and so I don't communicate that how I'd like.

comment by Akash (akash-wasil) · 2024-01-20T11:27:06.191Z · LW(p) · GW(p)

My impression is that the Shoggath meme was meant to be a simple meme that says "hey, you might think that RLHF 'actually' makes models do what we value, but that's not true. You're still left with an alien creature who you don't understand and could be quite scary."

Most of the Shoggath memes I've seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of "I should be scared/concerned" reaction. But I don't think it does so in a "see, AI will definitely be evil" way– it does so in a "look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it's pretty reasonable to be scared about this foreign alien thing that we don't understand."

To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as "propaganda". Let's put aside the fact that you and I might disagree about how much "true evidence" RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF'd models] spreads an overly-rosy "meme" that gives people a misleading perspective of how well we understand AI systems, how safe AI progress is, etc.

From this lens, I see Shoggath as a counter-meme. It basically says "hey look, the default is for people to think that these things are friendly assistants, because that's what the AI companies have turned them into, but we should remember that actually we are quite confused about the alien cognition behind the RLHF smiley face."

comment by kave · 2024-01-20T19:41:32.480Z · LW(p) · GW(p)

Some quotes from the wiki article on Shoggoths:

Being amorphous, shoggoths can take on any shape needed, making them very versatile within aquatic environments.

 

At the Mountains of Madness includes a detailed account of the circumstances of the shoggoths' creation by the extraterrestrial Elder Things. Shoggoths were initially used to build the cities of their masters. Though able to "understand" the Elder Things' language, shoggoths had no real consciousness and were controlled through hypnotic suggestion. Over millions of years of existence, some shoggoths mutated, developed independent minds, and rebelled.

Quoting because (a) a lot of these features seem like an unusually good match for LLMs and (b) acknowledging that is picking a metaphor that fictionally rebelled, and thus is potentially alignment-is-hard loaded as a metaphor.

comment by DanielFilan · 2024-01-20T22:58:50.683Z · LW(p) · GW(p)

I'm confident that if there were a "pro-AI" meme with a friendly-looking base model, LW / the shoggoth enjoyers would have nitpicked the friendly meme-creature to hell. They would (correctly) point out "hey, we don't actually know how these things work; we don't know them to be friendly, or what they even 'want' (if anything); we don't actually know what each stage of training does..."

I'm sure that nothing bad will happen to me if I slap this (friendly AI meme) on my laptop, right? I'll be able to think perfectly neutrally about whether AI will be friendly.

I have multiple cute AI stickers on my laptop, one of which is a shoggoth meme. Here is a picture of them. Nobody has ever nitpicked their friendly appearance to me. I don't think they have distorted my thinking about AI in favour of thinking that it will be friendly (altho I think it was after I put them on that I became convinced by a comment by Paul Christiano that there's ~even odds that unaligned AI wouldn't kill me, so do with that information what you will).

Replies from: TurnTrout
comment by TurnTrout · 2024-01-21T22:26:40.916Z · LW(p) · GW(p)

I think that "cute" image is still implying AI is dangerous and monsterlike? Can you show the others?

Replies from: DanielFilan
comment by DanielFilan · 2024-01-22T00:20:30.865Z · LW(p) · GW(p)

The other is the friendly robot waving hello just underneath.

comment by Sam Marks (samuel-marks) · 2024-01-20T02:29:45.864Z · LW(p) · GW(p)

Thanks, I've disliked the shoggoth meme for a while, and this post does a better job articulating why than I've been able to do myself.

comment by Vladimir_Nesov · 2024-01-20T11:36:16.566Z · LW(p) · GW(p)

A bad map that expresses the territory with great uncertainty can be confidently called a bad map, calling it a good map is clearly wrong. In that sense the shoggoth imagery reflects the quality of the map, and as it's clearly a bad map, better imagery would be misleading about the map's quality. Even if the underlying territory is lovely, this isn't known, unlike the disastorous quality of the map of the territory, whose lack of quality is known with much more confidence and in much greater detail. Here be dragons.

(This is one aspect of the meme where it seems appropriate. Some artist's renditions, including the one you used, channel LeCake, which your alternative image example loses, but obviously the cake is nicer than the shoggoth.)

comment by green_leaf · 2024-01-20T15:20:51.427Z · LW(p) · GW(p)

However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It's at this point that I'd like to point out the simple, obvious fact that "we don't actually know how these models work, and we definitely don't know that they're creepy and dangerous on the inside."

It's optimized to illustrate the point that the neural network isn't trained to actually care about what the person training it thinks it came to care about, it's only optimized to act that way on the training distribution. Unless I'm missing something, arguing the image is wrong would be equivalent to arguing that maybe the model truly cares about what its human trainers want it to care about. (Which we know isn't actually the case.)

comment by tailcalled · 2024-01-20T12:09:26.807Z · LW(p) · GW(p)

Nice point.

Though I'm almost tempted to think of LLMs as being like people who are LARPing or who have impostor syndrome [LW · GW]. As in, they spend pretty much all their cognitive capacity on obsessing over doing what they feel looks normal. (This also closely aligns with how they are trained: first they are made to mimic what other people do, and then they are made to mimic what gets praise and avoid what gets critique.) Probably humanizes them even more than your friendly creature proposal.

This sounds somewhat similar to deceptive alignment, so I want to draw a distinction here: It's not that LARPers/impostors are trying to maximize approval in a consequentialist sense (as this would require modelling how their actions ripple out into the world, which they do not do), but rather that (in the sense described by shard theory) they are molded based on normality and approval. As such they would not do something abnormal/disapproved-of in order to look more normal/approval-worthy later.

comment by TurnTrout · 2023-11-14T21:55:06.165Z · LW(p) · GW(p)

Another point for feature universality. Subtle adversarial image manipulations influence both human and machine perception:

... we find that adversarial perturbations that fool ANNs similarly bias human choice. We further show that the effect is more likely driven by higher-order statistics of natural images to which both humans and ANNs are sensitive, rather than by the detailed architecture of the ANN.

Replies from: D0TheMath, Throwaway2367
comment by Garrett Baker (D0TheMath) · 2023-11-14T22:06:43.018Z · LW(p) · GW(p)

I've seen this, their examples don't seem so subtle to me compared with alternatives. 

For example,

 

You can clearly see a cat in the center of the left image!

Replies from: TurnTrout
comment by TurnTrout · 2023-11-14T22:36:49.582Z · LW(p) · GW(p)

I mostly... can just barely see an ear above the train if I look, after being told to look there. I don't think it's "clear." I also note that these are black-box attacks on humans which originated from ANNs; these are transferred attacks from eg a CNN.

Replies from: faul_sname, D0TheMath
comment by faul_sname · 2023-11-15T04:29:17.540Z · LW(p) · GW(p)

I can definitely see the cat.

Here, does this version help you see the cat? The ear is where you noted, and the left (from your perspective) eye is made up of the leftmost wiper blade plus a diagonal line at the top-left side of the left window that did not exist in the original image, and then the mouth is directly above the staircase at the top of the black horizontal stripe.=

To make that image, I converted both the left and right images to HSL, took the difference in luminance between the two, multiplied by 2.5, and applied that to the control image to yield the new "cat" image.

If you still don't see it, below is an animated version of that transform alternating back and forth between 0x and 3x on that parameter.

 

You can also try it out interactively here

Replies from: gwern
comment by gwern · 2023-11-15T23:12:49.083Z · LW(p) · GW(p)

The 'cat' here is the pair of white blobs on the crown of the train's forehead, right? I have to say, no matter how I stare at it, it never looks 'cat' like to me. It only looks spider-like, a little.

Replies from: faul_sname
comment by faul_sname · 2023-11-16T00:03:26.793Z · LW(p) · GW(p)

Original for reference:

Replies from: D0TheMath, gwern
comment by Garrett Baker (D0TheMath) · 2023-11-16T00:36:17.663Z · LW(p) · GW(p)

To my eyes, the original image makes the cat clearer than your modifications.

Replies from: faul_sname
comment by faul_sname · 2023-11-16T00:45:00.888Z · LW(p) · GW(p)

What can I say, I'm a programmer, not an artist :)

comment by gwern · 2023-12-10T02:40:25.981Z · LW(p) · GW(p)

Oh, that's very surprising to me. I didn't ever see the slightest hint of that left ear you draw there. You really see an ear there in the random bit of dark foliage? Huh.

Replies from: faul_sname, tslarm
comment by faul_sname · 2023-12-10T09:34:37.167Z · LW(p) · GW(p)

Yeah, quite clearly. It looks like the ear of this cat, roughly. The eye shape is pretty similar too.

In addition, the tree above the train to the left turns into a blob with leopard spot patterns like this, and in the upper right foliage there is a fur-like texture with tiger stripes.

Anyway, I now have better drawing tools. Here is what I see, done by tracing over the picture in a separate layer:

comment by tslarm · 2023-12-10T03:15:56.795Z · LW(p) · GW(p)

I'm assuming you're talking about our left, because you mentioned 'dark foliage'. If so, that's probably the most obvious part of the cat to me. But I find it much easier to see when I zoom in/enlarge the image, and I think I missed it entirely when I first saw the image (at 1x zoom). I suspect the screen you're viewing it on can also make a difference; for me the ear becomes much more obvious when I turn the brightness up or the contrast down. (I'm tweaking the image rather than my monitor settings, but I reckon the effect is similar.)

comment by Garrett Baker (D0TheMath) · 2023-11-14T22:48:21.816Z · LW(p) · GW(p)

In the discord I saw this dropped in at least 3 regulars immediately spotted the cat shape.

comment by TurnTrout · 2023-07-31T17:37:44.477Z · LW(p) · GW(p)

Consider what update equations have to say about "training game" scenarios. In PPO, the optimization objective is proportional to the advantage given a policy