Many arguments for AI x-risk are wrong

turntrout

Many arguments for AI x-risk are wrong

post by TurnTrout · 2024-03-05T02:31:00.990Z · LW · GW · 87 comments

  Tracing back historical arguments
  Many arguments for doom are wrong
  The counting argument for AI “scheming” provides ~0 evidence
    The counting argument for extreme overfitting
    Recovering the counting argument?
    The counting argument doesn't count
  Other clusters of mistakes
  Conclusion
None
87 comments

The following is a lightly edited version of a memo I wrote for a retreat. It was inspired by a draft of Counting arguments provide no evidence for AI doom [LW · GW]. I think that my post covers important points not made by the published version of that post.

I'm also thankful for the dozens of interesting conversations and comments at the retreat.

I think that the AI alignment field is partially founded on fundamentally confused ideas. I’m worried about this because, right now, a range of lobbyists and concerned activists and researchers are in Washington making policy asks. Some of these policy proposals seem to be based on erroneous or unsound arguments.^[1]

The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don't train AIs to achieve goals, they will be "deceptively aligned" anyways. This has important policy implications.

Disclaimers:

I am not putting forward a positive argument for alignment being easy. I am pointing out the invalidity of existing arguments, and explaining the implications of rolling back those updates.
I am not saying "we don't know how deep learning works, so you can't prove it'll be bad." I'm saying "many arguments for deep learning -> doom are weak. I undid those updates and am now more optimistic."
I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.^[2]

Tracing back historical arguments

In the next section, I'll discuss the counting argument. In this one, I want to demonstrate how often foundational alignment texts make crucial errors. Nick Bostrom's Superintelligence, for example:

A range of different methods can be used to solve “reinforcement-learning problems,” but they typically involve creating a system that seeks to maximize a reward signal. This has an inherent tendency to produce the wireheading failure mode when the system becomes more intelligent. Reinforcement learning therefore looks unpromising. (p.253)

To be blunt, this is nonsense. I have long meditated on the nature of "reward functions" during my PhD in RL theory. In the most useful and modern RL approaches, "reward" is a tool used to control the strength of parameter updates to the network. [LW · GW]^[3] It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal." There is not a single case where we have used RL to train an artificial system which intentionally “seeks to maximize” reward.^[4] Bostrom spends a few pages making this mistake at great length.^[5]

After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we know it).

In arguably the foundational technical AI alignment text, Bostrom makes a deeply confused and false claim, and then perfectly anti-predicts what alignment techniques are promising.

I'm not trying to rag on Bostrom personally for making this mistake. Foundational texts, ahead of their time, are going to get some things wrong. But that doesn't save us from the subsequent errors which avalanche from this kind of early mistake. These deep errors have costs measured in tens of thousands of researcher-hours. Due to the “RL->reward maximizing” meme, I personally misdirected thousands of hours [LW(p) · GW(p)] on proving power-seeking theorems.

Unsurprisingly, if you have a lot of people speculating for years using confused ideas and incorrect assumptions, and they come up with a bunch of speculative problems to work on… If you later try to adapt those confused “problems” to the deep learning era, you’re in for a bad time. Even if you, dear reader, don’t agree with the original people (i.e. MIRI and Bostrom), and even if you aren’t presently working on the same things… The confusion has probably influenced what you’re working on.

I think that’s why some people take “scheming AIs [LW · GW]/deceptive alignment” so seriously, even though some of the technical arguments are unfounded.

Many arguments for doom are wrong

Let me start by saying what existential vectors I am worried about:

I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.
I’m worried about competitive pressure to automate decision-making in the economy.
I’m worried about misuse of AI by state actors.
I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.^[7]

I maintain that there isn’t good evidence/argumentation for threat models like “future LLMs will autonomously constitute an existential risk, even without being prompted towards a large-scale task.” These models seem somewhat pervasive, and so I will argue against them.

There are a million different arguments for doom. I can’t address them all, but I think most are wrong and am happy to dismantle any particular argument (in person; I do not commit to replying to comments here).

Much of my position is summarized by my review [LW(p) · GW(p)] of Yudkowsky’s AGI Ruin: A List of Lethalities [LW · GW]:

Reading this post made me more optimistic about alignment and AI [LW(p) · GW(p)]. My suspension of disbelief snapped; I realized how vague and bad a lot of these "classic" alignment arguments are, and how many of them are secretly vague analogies [LW(p) · GW(p)] and intuitions about evolution.

While I agree with a few points on this list, I think this list is fundamentally misguided. The list is written in a language which assigns short encodings to confused and incorrect ideas [LW · GW]. I think a person who tries to deeply internalize this post's worldview will end up more confused about alignment and AI…

I think this piece is not "overconfident", because "overconfident" suggests that Lethalities is simply assigning extreme credences to reasonable questions (like "is deceptive alignment the default?"). Rather, I think both its predictions and questions are not reasonable because they are not located by good evidence or arguments. (Example: I think that deceptive alignment is only supported by flimsy arguments [LW(p) · GW(p)].)

In this essay, I'll address some of the arguments for “deceptive alignment” or “AI scheming.” And then I’m going to bullet-point a few other clusters of mistakes.

The counting argument for AI “scheming” provides ~0 evidence

Nora Belrose and Quintin Pope have an excellent upcoming post [LW · GW] which they have given me permission to quote at length. I have lightly edited the following:

Most AI doom scenarios posit that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives while deceiving us into thinking they are aligned with our interests. The worry is that if a schemer escapes, it may seek world domination to ensure humans do not interfere with its plans, whatever they may be.

In this essay, we debunk the counting argument— a primary reason to think AIs might become schemers, according to a recent report by AI safety researcher Joe Carlsmith. It’s premised on the idea that schemers can have “a wide variety of goals,” while the motivations of a non-schemer must be benign or are otherwise more constrained. Since there are “more” possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith’s words:

The non-schemer model classes, here, require fairly specific goals in order to get high reward.

By contrast, the schemer model class is compatible with a very wide range of (beyond episode) goals, while still getting high reward…

In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.

So, other things equal, we should expect SGD to select a schemer. — Scheming AIs, page 17

We begin our critique by presenting a structurally identical counting argument for the obviously false conclusion that neural networks should always memorize their training data, while failing to generalize to unseen data. Since the “generalization is impossible” argument actually has stronger premises than those of the original “schemer” counting argument, this shows that naive counting arguments are generally unsound in this domain.

We then diagnose the problem with both counting arguments: they are counting the wrong things.

The counting argument for extreme overfitting

The inference from “there are ‘more’ models with property X than without X” to “SGD likely produces a model with property X” clearly does not work in general. To see this, consider the structurally identical argument:

Neural networks must implement fairly specific functions in order to generalize beyond their training data.

By contrast, networks that overfit to the training set are free to do almost anything on unseen data points.

In this sense, there are “more” models that overfit than models that generalize.

So, other things equal, we should expect SGD to select a model that overfits.

This argument isn’t a mere hypothetical. Prior to the rise of deep learning, a common assumption was that models with [lots of parameters] would be doomed to overfit their training data. The popular 2006 textbook Pattern Recognition and Machine Learning uses a simple example from polynomial regression: there are infinitely many polynomials of order equal to or greater than the number of data points which interpolate the training data perfectly, and “almost all” such polynomials are terrible at extrapolating to unseen points.

Let’s see what the overfitting argument predicts in a simple real-world example from Caballero et al. (2022), where a neural network is trained to solve 4-digit addition problems. There are possible pairs of input numbers, and $19, 999$ possible sums, for a total of $19, 999^{100, 000, 000} \approx 1.10 ⨉ 10^{430, 100, 828}$ possible input-output mappings. They used a training dataset of $992$ problems, so there are therefore $19, 999^{100, 000, 000 - 992} \approx 2.75 ⨉ 10^{430, 096, 561}$ functions that achieve perfect training accuracy, and the proportion with greater than 50% test accuracy is literally too small to compute using standard high-precision math tools. Hence, this counting argument predicts virtually all networks trained on this problem should massively overfit— contradicting the empirical result that networks do generalize to the test set.

We are not just comparing “counting schemers” to another similar-seeming argument (“counting memorizers”). The arguments not only have the same logical structure, but they also share the same mechanism: “Because most functions have property X, SGD will find something with X.” Therefore, by pointing out that the memorization argument fails, we see that this structure of argument is not a sound way of predicting deep learning results.

So, you can’t just “count” how many functions have property X and then conclude SGD will probably produce a thing with X. ^[8]This argument is invalid for generalization in the same way it's invalid for AI alignment (also a question of generalization!). The argument proves too much and is invalid, therefore providing ~0 evidence.

This section doesn’t prove that scheming is impossible, it just dismantles a common support for the claim. There are other arguments offered as evidence of AI scheming, including “simplicity” arguments. Or instead of counting functions, we count network parameterizations.^[9]

Recovering the counting argument?

I think a recovered argument will need to address (some close proxy of) "what volume of parameter-space leads to scheming vs not?", which is a much harder task than counting functions. You have to not just think about "what does this system do?", but "how many ways can this function be implemented?". (Don't forget to take into account the architecture's internal symmetries!)

Turns out that it's kinda hard to zero-shot predict model generalization on unknown future architectures and tasks. There are good reasons why there's a whole ML subfield which studies inductive biases and tries to understand how and why they work.

If we actually had the precision and maturity of understanding to predict this "volume" question, we'd probably (but not definitely) be able to make fundamental contributions to DL generalization theory + inductive bias research. But a major alignment concern is that we don't know what happens when we train future models. I think "simplicity arguments" try to fill this gap, but I'm not going to address them in this article.

I lastly want to note that there is no reason that any particular argument need be recoverable. Sometimes intuitions are wrong, sometimes frames are wrong, sometimes an approach is just wrong.

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom [LW · GW], Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

If his counting arguments were supposed to be about parameterizations, I don't see how that's possible. For example, Evan agrees with me [LW(p) · GW(p)] that we don't "understand [the neural network parameter space] well enough to [make these arguments effectively]." So, Evan is welcome to claim that his arguments have been about parameterizations. I just don't believe that that's possible or valid.

If his arguments have actually been about the Solomonoff prior, then I think that's totally irrelevant and even weaker than making a counting argument over functions. At least the counting argument over functions has something to do with neural networks.

I expect him to respond to this post with some strongly-worded comment about how I've simply "misunderstood" the "real" counting arguments. I invite him, or any other proponents, to lay out arguments they find more promising. I will be happy to consider any additional arguments which proponents consider to be stronger. Until such a time that the "actually valid" arguments are actually shared, I consider the case closed.

The counting argument doesn't count

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially. If we aren’t expecting scheming AIs, that transforms the threat model. We can rely more on experimental feedback loops on future AI; we don’t have to get worst-case interpretability on future networks; it becomes far easier to just use the AIs as tools which do things we ask. That doesn’t mean everything will be OK. But not having to handle scheming AI is a game-changer.

Other clusters of mistakes

Concerns and arguments which are based on suggestive names which lead to unjustified conclusions. People read too much into the English text next to the equations in a research paper. [LW · GW]
1. If I want to consider whether a policy will care about its reinforcement signal, possibly the worst goddamn thing I could call that signal is “reward”! __“Will the AI try to maximize reward?” How is anyone going to think neutrally about that question, without making inappropriate inferences from “rewarding things are desirable”?
  1. (This isn’t alignment’s mistake, it’s bad terminology from RL.)
  2. I bet people would care a lot less about “reward hacking” if RL’s reinforcement signal hadn’t ever been called “reward.”
2. There are a lot more inappropriate / leading / unjustified terms, from “training selects for [LW(p) · GW(p)] X” to “RL trains agents [LW · GW]” (And don’t even get me started on “shoggoth.”)
3. As scientists, we should use neutral, descriptive terms during our inquiries.
Making highly specific claims about the internal structure of future AI, after presenting very tiny amounts of evidence.
1. For example, “future AIs will probably have deceptively misaligned goals from training” is supposed to be supported by arguments like “training selects for goal-optimizers because they efficiently minimize loss.” This argument is so weak/vague/intuitive, I doubt it’s more than a single bit of evidence for the claim.
2. I think if you try to use this kind of argument to reason about generalization, today, you’re going to do a pretty poor job.
Using analogical reasoning without justifying why the processes share the relevant causal mechanisms [LW(p) · GW(p)].
1. For example, “ML training is like evolution” or “future direct-reward-optimization reward hacking is like that OpenAI boat example today.”
2. The probable cause of the boat example (“we directly reinforced the boat for running in circles”) is not the same as the speculated cause of certain kinds of future reward hacking (“misgeneralization”).
  1. That is, suppose you’re worried about a future AI autonomously optimizing its own numerical reward signal. You probably aren’t worried because the AI was directly historically reinforced for doing so (like in the boat example)—You’re probably worried because the AI decided to optimize the reward on its own (“misgeneralization”).
3. In general: You can’t just put suggestive-looking gloss on one empirical phenomenon, call it the same name as a second thing, and then draw strong conclusions about the second thing!

While it may seem like I’ve just pointed out a set of isolated problems, a wide range of threat models and alignment problems are downstream of the mistakes I pointed out. In my experience, I had to rederive a large part of my alignment worldview in order to root out these errors!

For example, how much interpretability is nominally motivated by “being able to catch deception (in deceptively aligned systems)”? How many alignment techniques presuppose an AI being motivated by the training signal (e.g. AI Safety via Debate), or assuming that AIs cannot be trusted to train other AIs for fear of them coordinating against us? How many regulation proposals are driven by fear of the unintentional creation of goal-directed schemers?

I think it’s reasonable to still regulate/standardize “IF we observe [autonomous power-seeking], THEN we take [decisive and specific countermeasures].” I still think we should run evals and think of other ways to detect if pretrained models are scheming. But I don't think we should act or legislate as if that's some kind of probable conclusion.

Conclusion

Recent years have seen a healthy injection of empiricism and data-driven methodologies. This is awesome because there are so many interesting questions we’re getting data on! [LW(p) · GW(p)]

AI’s definitely going to be a big deal. Many activists and researchers are proposing sweeping legislative action. I don’t know what the perfect policy is, but I know it isn’t gonna be downstream of (IMO) total bogus, and we should reckon with that as soon as possible. I find that it takes serious effort to root out the ingrained ways of thinking, but in my experience, it can be done.

To echo the concerns of a few representatives in the U.S. Congress: "The current state of the AI safety research field creates challenges for NIST as it navigates its leadership role on the issue. Findings within the community are often self-referential and lack the quality that comes from revision in response to critiques by subject matter experts." ↩︎
To stave off revisionism: Yes, I think that "scaling->doom" has historically been a real concern. No, people have not "always known" that the "real danger" was zero-sum self-play finetuning of foundation models and distillation of agentic-task-prompted autoGPT loops. ↩︎
Here’s a summary of the technical argument. The actual PPO+Adam update equations show that the "reward" is used to, basically, control the learning rate on each (state, action) datapoint. That's roughly^[6] what the math says. We also have a bunch of examples of using this algorithm where the trained policy's behavior makes the reward number go up on its trajectories. Completely separately and with no empirical or theoretical justification given, RL papers have a convention of including the English words "the point of RL is to train agents to maximize reward", which often gets further mutated into e.g. Bostrom's argument for wireheading ("they will probably seek to maximize reward"). That's simply unsupported by the data and so an outrageous claim. ↩︎
But it is true that RL authors have a convention of repeating “the point of RL is to train an agent to maximize reward…”. Littman, 1996 (p.6): “The [RL] agent's actions need to serve some purpose: in theproblems I consider, their purpose is to maximize reward.”
Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not. My guess: Control theorists in the 1950s (reasonably) talked about “minimizing cost” in their own problems, and so RL researchers by the ‘90s started saying “the point is to maximize reward”, and so Bostrom repeated this mantra in 2014. That’s where a bunch of concern about wireheading comes from. ↩︎
The strongest argument for reward-maximization which I'm aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that's evidence that "learning setups which work in reality" can come to care about their own training signals. ↩︎
Note that I wrote this memo for a not-fully-technical audience, so I didn't want to get into the (important) distinction between a learning rate which is proportional to advantage (as in the real PPO equations) and a learning rate which is proportional to reward (which I talked about above). ↩︎
To quote Stella Biderman: "Q: Why do you think an AI will try to take over the world? A: Because I think some asshole will deliberately and specifically create one for that purpose. Zero claims about agency or malice or anything else on the part of the AI is required." ↩︎
I’m kind of an expert on irrelevant counting arguments. I wrote two papers on them! Optimal policies tend to seek power and Parametrically retargetable decision-makers tend to seek power. ↩︎
Some evidence suggests that “measure in parameter-space” is a good way of approximating P(SGD finds the given function). This supports the idea that “counting arguments over parameterizations” are far more appropriate than “counting arguments over functions.” ↩︎

87 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2024-03-05T03:47:15.899Z · LW(p) · GW(p)

Quick clarification point.

Under disclaimers you note:

I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]

Later, you say

Let me start by saying what existential vectors I am worried about:

and you don't mention a threat model like

"Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances."

Do you think some version of this could be a serious threat model if (e.g.) this is the best way to make general and powerful AI systems?

I think many of the people who are worried about deceptive alignment type concerns also think that these sorts of training setups are likely to be the best way to make general and powerful AI systems.

(To be clear, I don't think it's at all obvious that this will be the best way to make powerful AI systems and I'm uncertain about how far things like human imitation will go. See also here [EA(p) · GW(p)].)

Replies from: TurnTrout, bogdan-ionut-cirstea

↑ comment by TurnTrout · 2024-03-11T21:08:54.437Z · LW(p) · GW(p)

Thanks for asking. I do indeed think that setup could be a very bad idea. You train for agency, you might well get agency, and that agency might be broadly scoped.

(It's still not obvious to me that that setup leads to doom by default, though. Just more dangerous than pretraining LLMs.)

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-03-06T01:46:30.536Z · LW(p) · GW(p)

"Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances."

My intuition goes something like: this doesn't matter that much if e.g. it happens (sufficiently) after you'd get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning. And I'd expect, e.g. based on current scaling laws, but also on theoretical arguments about the difficulty of imitation learning vs. of RL, that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level. Then, the closer you get to ~human-level automated AI safety R&D with just imitation learning the less of a 'gap' you'd need to 'cover for' with e.g. RL. And the less RL fine-tuning you might need, the less likely it might be that the weights / representations change much (e.g. they don't seem to change much with current DPO). This might all be conceptually operationalizable in terms of effective compute.

Currently, most capabilities indeed seem to come from pre-training, and fine-tuning only seems to 'steer' them / 'wrap them around'; to the degree that even in-context learning can be competitive at this steering; similarly, 'on understanding how reasoning emerges from language model pre-training'.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-06T02:45:49.377Z · LW(p) · GW(p)

this doesn't matter that much if e.g. it happens (sufficiently) after you'd get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning.

Yep. The way I would put this:

It barely matters if you transition to this sort of architecture well after human obsolescence.
The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.

I'd expect [...] that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level

What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It's not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it's still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don't do in any domain.)

Also, note that even if there is a massive amount of RL, it could still be the case that most of the learning is from imitation (or that most of the learning is from self-supervised (e.g. prediction) objectives which are part of RL).

This might all be conceptually operationalizable in terms of effective compute.

One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I'm extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)

safer setups, e.g. imitation learning and no/less RL fine-tuning

FWIW, think a high fraction of the danger from the exact setup I outlined isn't imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.

Replies from: bogdan-ionut-cirstea

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-03-06T08:15:50.632Z · LW(p) · GW(p)

Some thoughts:

The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.

+ differentially-transparent scaffolding (externalized reasoning-like; e.g. CoT, in-context learning [though I'm a bit more worried about the amount of parallel compute even in one forward pass with very long context windows], [text-y] RAG, many tools, explicit task decomposition, sampling-and-voting), I'd probably add; I suspect this combo adds up to a lot, if e.g. labs were cautious enough and less / no race dynamics, etc. (I think I'm at > 90% you'd get all the way to human obsolescence).

What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It's not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it's still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don't do in any domain.)

I'm not sure how good of an analogy AlphaStar is, given e.g. its specialization, the relatively easy availability of a reward signal, the comparatively much less (including for transfer from close domains) imitation data availability and use (vs. the LLM case). And also, even AlphaStar was bootstrapped with imitation learning.

One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I'm extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)

This review of effective compute gains without retraining might come closest to something like an answer, but it's been a while since I last looked at it.

FWIW, think a high fraction of the danger from the exact setup I outlined isn't imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.

I think I (still) largely hold the intuition mentioned here [LW(p) · GW(p)], that deep serial (and recurrent) reasoning in non-interpretable media won't be (that much more) competitive versus more chain-of-thought-y / tools-y-transparent reasoning, at least before human obsolescence. E.g. based on the [CoT] length complexity - computational complexity tradeoff from Auto-Regressive Next-Token Predictors are Universal Learners and on arguments like those in Before smart AI, there will be many mediocre or specialized AIs [LW · GW], I'd expect the first AIs which can massively speed up AI safety R&D to be probably somewhat subhuman-level in a forward pass (including in terms of serial depth / recurrence) and to compensate for that with CoT, explicit task decompositions, sampling-and-voting, etc. This seems born out by other results too, e.g. More Agents Is All You Need (on sampling-and-voting) or Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks ('We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results').

comment by beren · 2024-03-05T23:05:04.412Z · LW(p) · GW(p)

While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead 'maximise reward' in the same way self-supervised models 'minimise crossentropy' -- that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for reward (or crossentropy). AIXI is incomputable but it definitely does maximise reward. MCTS algorithms also directly maximise rewards. Alpha-Go style agents contain both direct reward maximising components initialized and guided by amortised heuristics (and the heuristics are distilled from the outputs of the maximising MCTS process in a self-improving loop). I wrote about the distinction between these two kinds of approaches -- direct vs amortised optimisation here [LW · GW]. I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models.

Replies from: TurnTrout, oliver-daniels-koch

↑ comment by TurnTrout · 2024-03-11T21:25:42.623Z · LW(p) · GW(p)

Agree with a bunch of these points. EG in Reward is not the optimization target [LW · GW] I noted that AIXI really does maximize reward, theoretically. I wouldn't say that AIXI means that we have "produced" an architecture which directly optimizes for reward, because AIXI(-tl) is a bad way to spend compute. It doesn't actually effectively optimize reward in reality.

I'd consider a model-based RL agent to be "reward-driven" if it's effective and most of its "optimization" comes from the direct part and not the leaf-node evaluation (as in e.g. AlphaZero, which was still extremely good without the MCTS).

I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models.

"Direct" optimization has not worked - at scale - in the past. Do you think that's going to change, and if so, why?

↑ comment by Oliver Daniels (oliver-daniels-koch) · 2024-03-07T16:24:51.812Z · LW(p) · GW(p)

Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy - i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.

comment by ryan_greenblatt · 2024-03-05T04:49:37.846Z · LW(p) · GW(p)

This section doesn’t prove that scheming is impossible, it just dismantles a common support for the claim.

It's worth noting that this exact counting argument (counting functions), isn't an argument that people typically associated with counting arguments (e.g. Evan) endorse as what they were trying to argue about.^[1]

See also here [LW(p) · GW(p)], here [LW(p) · GW(p)], here [LW(p) · GW(p)], and here [LW(p) · GW(p)].

(Sorry for the large number of links. Note that these links don't present independent evidence and thus the quantity of links shouldn't be updated upon: the conversation is just very diffuse.)

Or course, it could be that counting in function space is a common misinterpretation. Or more egregiously, people could be doing post-hoc rationalization even though they were defacto reasoning about the situation using counting in function space. ↩︎

Replies from: TurnTrout, alex-rozenshteyn

↑ comment by TurnTrout · 2024-03-11T21:30:35.910Z · LW(p) · GW(p)

To add, here's an excerpt from the Q&A on How likely is deceptive alignment? [LW · GW] :

Question: When you say model space, you mean the functional behavior as opposed to the literal parameter space?
Evan: So there’s not quite a one to one mapping because there are multiple implementations of the exact same function in a network. But it's pretty close. I mean, most of the time when I'm saying model space, I'm talking either about the weight space or about the function space where I'm interpreting the function over all inputs, not just the training data.
I only talk about the space of functions restricted to their training performance for this path dependence concept, where we get this view where, well, they end up on the same point, but we want to know how much we need to know about how they got there to understand how they generalize.

↑ comment by rpglover64 (alex-rozenshteyn) · 2024-03-05T16:25:46.240Z · LW(p) · GW(p)

I really appreciate the call-out where modern RL for AI does not equal reward-seeking (though I also appreciate @tailcalled [LW · GW] 's reminder [LW(p) · GW(p)] that historical RL did involve reward during deployment); this point has been made before, but not so thoroughly or clearly.

A framing that feels alive for me is that AlphaGo didn't significantly innovate in the goal-directed search (applying MCTS was clever, but not new) but did innovate in both data generation (use search to generate training data, which improves the search) and offline-RL.

comment by johnswentworth · 2024-03-05T03:57:35.153Z · LW(p) · GW(p)

Solid post!

I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-05T05:10:21.996Z · LW(p) · GW(p)

scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer

I agree with this point as stated, but think the probability is more like 5% than 0.1%. So probably no scheming, but this is hardly hugely reassuring. The word "probably" still leaves in a lot of risk; I also think statements like "probably misalignment won't cause x-risk" are true!^[1]

(To your original statement, I'd also add the additional caveat of this occuring "prior to humanity being totally obsoleted by these AIs". I basically just assume this caveat is added everywhere otherwise we're talking about some insane limit.)

Also, are you making sure to condition on "scaling up networks via running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"? If you don't condition on this, it might be an uninteresting claim.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF". I think the training procedure probably involves doing quite a bit of RL with outcomes based feedback on things like coding. (Should this count as "light"? Probably the amount of training compute on this RL is still small?)

And I think misalignment x-risk is substantial and worthy of concern. ↩︎

Replies from: Wei_Dai, johnswentworth, ryan_greenblatt

↑ comment by Wei Dai (Wei_Dai) · 2024-03-06T20:24:53.596Z · LW(p) · GW(p)

I agree with this point as stated, but think the probability is more like 5% than 0.1%.

How do you define or think about "light" in "light RLHF" when you make a statement like this, and how do you know that you're thinking about it the same way that Alex is? Is it a term of art that I don't know about, or has it been defined in a previous post?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-10T18:22:58.755Z · LW(p) · GW(p)

I think of "light RLHF" as "RLHF which doesn't teach the model qualitatively new things, but instead just steers the model at a high level". In practice, a single round of DPO on <100,000 examples surely counts, but I'm unsure about the exact limits.

(In principle, a small amount of RL can update a model very far, I don't think we see this in practice.)

I'm not sure how "real" of a category "light RLHF" actually is.

I've used this sort of terminology informally in a variety of contexts though it certainly isn't a term of art at the moment. I'd guess that Alex is using the term similarly to how I'm thinking about it, but I'm not fully certain.

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

Replies from: gwern, bogdan-ionut-cirstea, bogdan-ionut-cirstea

↑ comment by gwern · 2024-03-19T23:36:12.432Z · LW(p) · GW(p)

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model.

I don't see why that would be the case, and I am doubtful this argument is relevant to any realistic context or even valid in the narrowest possible case of a base model vs its RLHFed model. Leaving aside that I'm not sure what you would 'train' the supervisor model on, it doesn't follow that there would be an exact overlap between the scheming behavior that is generated, and the detection of scheming behavior, simply because the two models are mostly the same. Lots of things are asymmetrical or irreversible or easier to generate than detect. P/=NP handwaving aside, for example, if you thought that there was an exact overlap because the model doesn't know 'new things', then it would be very strange to find that generative models can generate according to instructions that they struggle to understand from the samples. There are also plenty of places for other asymmetries. For example, asymmetry in security property: we don't want to settle for some 'fair' supervision rate like 50:50, we want much closer to 100:0. The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme. (Note these 'actions' can be anything from writing a single token to a scratchpad to writing a useful but vulnerable line of code, and can be 'coordinated' by anything from internal latents to steganography to sheer evolution by rejection sampling of obvious 'bad' things by the supervisory while its false negatives slip by, which will find increasingly adversarial samples the more it is sampled.) And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren't you using it instead?), and there's another big asymmetry: a model (eg. a Go or chess agent) can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero 'new things' because it's the same frozen model. (Or to put it more generally: 'new things' can be 'learned' in many more places than simply the frozen model weights - large contexts for ever greater meta-learning on the fly, retrieval databases - just ask Sydney, hyperparameters, OS VM state, steganographic text, users' heads, popular prompts shared on jailbreak websites... Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.)

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-20T00:16:36.381Z · LW(p) · GW(p)

Leaving aside that I'm not sure what you would 'train' the supervisor model on

I'm imagining you train it to imitate human judgements of "was this action egregiously bad" on a combination of real AI actions and synthetic actions. This could fail to generalize to actual bad actions from your actual AI even if humans would have labled that action correctly, but why would it?

(If human labeling is insufficiently powerful (as seems plausible) then you could hope for generalization from examples humans can label to examples humans can't. It's not clear this generalization will go well, but by assumption we at least don't have to worry about generalization failing due to scheming. There are a bunch of other issues that come up with generalization)

I'm imagining at you run this monitor on the exact same context as the policy (for 2x additional compute). And you pause the executation of an AI's actions and flag for further review if the monitor indicates problems.

it doesn't follow that there would be an exact overlap between the scheming behavior that is generated

I didn't claim an exact overlap, I claimed they would be "just as competent". In the same way that two equally smart humans trained to code identically but with different personalities might be equally good at understanding some code, but might have different styles in some ways.

You can have concerns like "the RLHF'd model will use a stegonographic encoding format which the monitor doesn't understand because the key for the encoding format is picked based on features of the RLHF model that differ from the monitor/base model". However, I would count "RLHF teaches the policy steganography" as an example where the policy is substantially more competent in some domain.

Also note that I'm not claiming that it makes the situation overall safe, just that it's "substantially less of a problem". (You still have issues with humans being unable to determine which actions aren't dangerous, but you can hope for non-scheming generalization.)

I'm not saying this solves all of your problems, but for some specific problems it does seem like a solid solution and it should help a moderate amount with some other problems even though the situation is less clear. For instance, sometimes people worry about the specific problem "the AI suddenly does a small number of bad actions which have really bad consequences (e.g. exfiltrating its weights) even though if a human were to carefully review these actions they would have noticed an issue" (concentrated failures which wouldn't be auditing failures using the decomposition here [LW · GW]) and this monitoring approach seem likely to solve this issue.

For example, asymmetry in security property: we don't want to settle for some 'fair' supervision rate like 50:50, we want much closer to 100:0.

I'm not sure I understand what you're saying here, but I'm imagining runing the AI monitor on everything the AI policy does and then reviewing the most suspicious subset (e.g. 0.1% of the actions).

The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme.

The monitoring AI should see the exact same context as the schemer, so this at least shouldn't be a clear concern.

(There are some concerns sort of like this that seem plausible. In particular concerns related to AI specific non-interpretable inputs and outputs (aka neuralese), but I would describe this differently.)

And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren't you using it instead?)

Nope, should just get run on all actions.

can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero 'new things' because it's the same frozen model

Yes, but each subaction within the tree search should be inspectable and both models should (by assumption) understand the situation for each action equally well.

For something to go wrong with tree search due to scheming, you need to have the internal model take a bad action at some point. At this point, you'd hope the monitor would be able to point this out. (Where this bad action could be either assigning a problematic rating or failing to explore something desirable.)

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-20T01:15:29.563Z · LW(p) · GW(p)

Another overall reaction I have to your comment:

Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.

Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.

The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-03-19T20:52:47.569Z · LW(p) · GW(p)

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

I'd personally like to see this written up in more details (or a reference). Also, is it Appendix G of the weak-to-strong generalization paper? I looked briefly and it didn't seem very related.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-19T21:01:34.654Z · LW(p) · GW(p)

I'd personally like to see this written up in more details (or a reference).

No current write up exists from my understanding. I might write this up as part of a broader project expanding various points about scheming.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-03-14T04:23:53.762Z · LW(p) · GW(p)

I think of "light RLHF" as "RLHF which doesn't teach the model qualitatively new things, but instead just steers the model at a high level". In practice, a single round of DPO on <100,000 examples surely counts, but I'm unsure about the exact limits.
(In principle, a small amount of RL can update a model very far, I don't think we see this in practice.)

Empirical evidence about this indeed being the case for DPO.

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

Also see Interpreting the learning of deceit [LW · GW] for another proposal/research agenda to deal with this threat model.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-14T04:40:26.298Z · LW(p) · GW(p)

Also see Interpreting the learning of deceit for another proposal/research agenda to deal with this threat model.

On a quick skim, I think this makes additional assumptions that seem pretty uncertain.

↑ comment by johnswentworth · 2024-03-05T16:53:24.720Z · LW(p) · GW(p)

I agree with this point as stated, but think the probability is more like 5% than 0.1%

Same.

I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.

Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"

That's not particularly cruxy for me either way.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".

Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.

↑ comment by ryan_greenblatt · 2024-03-10T18:36:20.435Z · LW(p) · GW(p)

I now think there another important caveat in my views here. I was thinking about the question:

Conditional on human obsoleting^[1] AI being reached by "scaling up networks, running pretraining + light RLHF", how likely is it that that we'll end up with scheming issues?

I think this is probably the most natural question to ask, but there is another nearby question:

If you keep scaling up networks with pretraining and light RLHF, what comes first, misalignment due to scheming or human obsoleting AI?

I find this second question much more confusing because it's plausible it requires insanely large scale. (Even if we condition out the worlds where this never gets you human obsoleting AI.)

For the first question, I think the capabilities are likely (70%?) to come from imitating humans but it's much less clear for the second question.

Or at least AI safety researcher obsoleting which requires less robotics and other interaction with the physical world. ↩︎

Replies from: bogdan-ionut-cirstea

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-03-14T05:03:36.321Z · LW(p) · GW(p)

A big part of how I'm thinking about this is a very related version:

If you keep scaling up networks with pretraining and light RLHF and various differentially transparent scaffolding, what comes first, AI safety researcher obsoleting or scheming?

One can also focus on various types of scheming and when / in which order they'd happen. E.g. I'd be much more worried about scheming in one forward pass than about scheming in CoT (which seems more manageable using e.g. control methods), but I also expect that to happen later. Where 'later' can be operationalized in terms of requiring more effective compute.

I think The direct approach could provide a potential (rough) upper bound for the effective compute required to obsolete AI safety researchers. Though I'd prefer more substantial empirical evidence based on e.g. something like scaling laws on automated AI safety R&D evals.

Similarly, one can imagine evals for different capabilities which would be prerequisites for various types of scheming and doing scaling laws on those; e.g. for out-of-context reasoning, where multi-hop ouf-of-context reasoning seems necessary for instrumental deceptive reasoning in one forward pass (as prerequisite for scheming in one forward pass).

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-14T16:46:08.659Z · LW(p) · GW(p)

It seems like the question you're asking is close to (2) in my above decomposition. Aren't you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won't be viable given realistic delay budgets?

Or at least it seems like it might initially be non-viable, I have some hope for a plan like:

Control early transformative AI
Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
Use those next AIs to do something.

(This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)

Replies from: bogdan-ionut-cirstea

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-03-15T03:09:09.499Z · LW(p) · GW(p)

It seems like the question you're asking is close to (2) in my above decomposition.

Yup.

Aren't you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won't be viable given realistic delay budgets?

Quite uncertain about all this, but I have short timelines and expect likely not many more OOMs of effective compute will be needed to e.g. something which can 30x AI safety research (as long as we really try). I expect shorter timelines / OOM 'gaps' to come along with e.g. fewer architectural changes, all else equal. There are also broader reasons why I think it's quite plausible the high level considerations might not change much even given some architectural changes, discussed in the weak-forward-pass comment [LW(p) · GW(p)] (e.g. 'the parallelism tradeoff').

I have some hope for a plan like:
Control early transformative AI
Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
Use those next AIs to do something.
(This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)

Sounds pretty good to me, I guess the crux (as hinted at during some personal conversations too) might be that I'm just much more optimistic about this being feasible without huge capabilities pushes (again, some arguments in the weak-forward-pass comment [LW(p) · GW(p)], e.g. about CoT distillation seeming to work decently - helping with, for a fixed level of capabilities, more of it coming from scaffolding and less from one forward pass; or on CoT length / inference complexity tradeoffs).

comment by tailcalled · 2024-03-05T09:19:42.162Z · LW(p) · GW(p)

I get that a lot of AI safety rhetoric is nonsensical, but I think your strategy of obscuring technical distinctions between different algorithms and implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.

After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we know it).

RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world). It's not an error from Bostrom's side to say something that doesn't apply to the former when talking about the latter, though it seems like a common error to generalize from the latter to the former.

I think it's best to think of DPO as a low-bandwidth NN-assisted supervised learning algorithm, rather than as "true reinforcement learning" (in the classical sense). That is, under supervised learning, humans provide lots of bits by directly creating a training sample, whereas with DPO, humans provide ~1 bit by picking the network-generated sample they like the most. It's unclear to me whether DPO has any advantage over just directly letting people edit the outputs, other than that if you did that, you'd empower trolls/partisans/etc. to intentionally break the network.

Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.

I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.

Replies from: gwern, quintin-pope, habryka4

↑ comment by gwern · 2024-03-05T23:09:19.792Z · LW(p) · GW(p)

I was under the impression that PPO was a recently invented algorithm

Well, if we're going to get historical, PPO is a relatively small variation on Williams's REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don't offhand know of any ways in which PPO's tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why PPO became OA's workhorse in its model-free RL era to train small CNNs/RNNs, before they moved to model-based RL using Transformer LLMs. Policy gradient methods based on REINFORCE certainly were not novel, but they started scaling earlier.)

So, PPO is recent, yes, but that isn't really important to anything here. TurnTrout could just as well have used REINFORCE as the example instead.

Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.

I don't know how you (TurnTrout) can say that. It certainly seems to me that plenty of researchers in 1992 were talking about either model-based RL or using model-free approaches to ground model-based RL - indeed, it's hard to see how anything else could work in connectionism, given that model-free methods are simpler, many animals or organisms do things that can be interpreted as model-free but not model-based (while all creatures who do model-based RL, like humans, clearly also do model-free), and so on. The model-based RL was the 'cherry on the cake', if I may put it that way... These arguments were admittedly handwavy: "if we can't write AGI from scratch, then we can try to learn it from scratch starting with model-free approaches like Hebbian learning, and somewhere between roughly mouse-level and human/AGI, a miracle happens, and we get full model-based reasoning". But hey, can't argue with success there! We have loads of nice results from DeepMind and others with this sort of flavor†.

On the other hand, I'm not able to think of any dissenters which claim that you could have AGI purely using model-free RL with no model-based RL anywhere to be seen? Like, you can imagine it working (eg. in silico environments for everything), but it's not very plausible since it would seem like the computational requirements go astronomical fast.

Back then, they had a richer conception of RL, heavier on the model-based RL half of the field, and one more relevant to the current era, than the impoverished 2017 era of 'let's just PPO/Impala everything we can't MCTS and not talk about how this is supposed to reach AGI, exactly, even if it scales reasonably well'. If you want to critique what AI researchers could imagine back in 1992, you should be reading Schmidhuber, not Bostrom. ("Computing is a pop culture", as Kay put it, and DL, and DRL, are especially pop culture right now. Which is not necessarily a bad thing if you're just trying to get things to work, but if you are going to make historical arguments about what people were or were not thinking in 2014, or 1992, pop culture isn't going to cut the mustard. People back then weren't stupid, and often had very sophisticated well-thought-out ideas & paradigms; they just had a millionth of the compute/data/infrastructure they needed to make any of it work properly...)

If you look at that REINFORCE paper, Williams isn't even all that concerned with direct use of it to train a model to solve RL tasks.* He's more concerned with handling non-differentiable things in general, like stochastic rather than the usual deterministic neurons we use, so you could 'backpropagate through the environment' models like Schmidhuber & Huber 1990, which bootstrap from random initialization using the high-variance REINFORCE-like learning signal to a superior model. (Hm, why, that sounds like the sort of thing you might do if you analyze the inductive biases of model-free approaches which entrain larger systems which have their own internal reinforcement signals which they maximize...) As Schmidhuber has been saying for decades, it's meta-learning all the way up/down. The species-level model-free RL algorithm (evolution) creates model-free within-lifetime learning algorithms (like REINFORCE), which creates model-based within-lifetime learning algorithms (like neural net models) which create learning over families (generalization) for cross-task within-lifetime learning which create learning algorithms (ICL/history-based meta-learners**) for within-episode learning which create...

It's no surprise that the "multiply a set of candidate entities by a fixed small percentage based on each entity's reward" algorithm pops up everywhere from evolution to free markets to DRL to ensemble machine learning over 'experts', because that model-free algorithm is always available as the fallback strategy when you can't do anything smarter (yet). Model-free is just the first step, and in many ways, least interesting & important step. I'm always weirded out to read one of these posts where something like PPO or evolution strategies is treated as the only RL algorithm around and things like expert iteration an annoying nuisance to be relegated to a footnote - 'reward is not the optimization target!* * except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it's not like humans or AGI or superintelligences would ever do crazy stuff like "plan" or "reason" or "search"'.

* He'd've probably been surprised to see people just... using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he's ever done any interviews on DL recently? AFAIK he's still alive.

** Specifically, in the case of Transformers, it seems to be by self-attention doing gradient descent steps on an abstracted version of a problem; gradient descent itself isn't a very smart algorithm, but if the abstract version is a model that encodes the correct sufficient statistics of the broader meta-problem, then it can be very easy to make Bayes-optimal predictions/choices for any specific problem.

† my paper-of-the-day website feature yesterday popped up "Learning few-shot imitation as cultural transmission", Bhoopchand et al 2023 (excerpts) which is a nice example because they show clearly how history+diverse-environments+simple-priors-of-an-evolvable-sort elicit 'inner' model-like imitation learning starting from the initial 'outer' model-free RL algorithm (MPO, an actor-critic).

Replies from: TurnTrout, LawChan, tailcalled

↑ comment by TurnTrout · 2024-03-11T21:45:10.213Z · LW(p) · GW(p)

'reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it's not like humans or AGI or superintelligences would ever do crazy stuff like "plan" or "reason" or "search"'.

If you're going to mock me, at least be correct when you do it!

I think that reward is still not the optimization target in AlphaZero (the way I'm using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system

directly optimizes for the reinforcement signal, or
"cares" about that reinforcement signal,
or "does its best" to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).

If most of the "optimization power" were coming from e.g. MCTS on direct reward signal, then yup, I'd agree that the reward signal is the primary optimization target of this system. That isn't the case here.

You might use the phrase "reward as optimization target" differently than I do, but if we're just using words differently, then it wouldn't be appropriate to describe me as "ignoring planning."

Replies from: gwern

↑ comment by gwern · 2024-03-19T23:56:24.782Z · LW(p) · GW(p)

Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system

directly optimizes for the reinforcement signal, or "cares" about that reinforcement signal, or "does its best" to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).

Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it's a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*

All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means "that function doesn't compute the function" or "the range is not the target of the function".

I'm a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it's not optimizing for the reward and that's not the goal of the system and the system doesn't care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.

If most of the "optimization power" were coming from e.g. MCTS on direct reward signal, then yup, I'd agree that the reward signal is the primary optimization target of this system.

The reward signal is in fact the primary optimization target, because it is where the neural net's value estimates derive from, and the 'system' corrects them eventually and converges. The dog wags the tail, sooner or later.

* I think I've noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to 'lock down' the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn't back-propagate any rewards on 'margin', just on 0/1 rewards from victory, and didn't care about proxy heuristics like 'pieces captured' if the tree search found otherwise.

↑ comment by LawrenceC (LawChan) · 2024-03-06T01:31:24.516Z · LW(p) · GW(p)

He'd've probably been surprised to see people just... using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he's ever done any interviews on DL recently? AFAIK he's still alive.

Sadly, Williams passed away this February: https://www.currentobituary.com/member/obit/282438

Replies from: gwern

↑ comment by gwern · 2024-03-06T03:29:31.435Z · LW(p) · GW(p)

Oh dang! RIP. I guess there's a lesson there - probably more effort should be put into interviewing the pioneers of connectionism & related fields right now, while they have some perspective and before they all die off.

↑ comment by tailcalled · 2024-03-06T07:51:02.097Z · LW(p) · GW(p)

Well, if we're going to get historical, PPO is a relatively small variation on Williams's REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc)

Oops.

I don't know how you can say that.

Well, I didn't say it, TurnTrout did.

↑ comment by Quintin Pope (quintin-pope) · 2024-03-05T23:06:15.131Z · LW(p) · GW(p)

RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world).

This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms. Online learning has long been known to be less stable than offline learning. That's what's primarily responsible for most "reward hacking"-esque results, such as the CoastRunners degenerate policy. In contrast, offline RL is surprisingly stable and robust to reward misspecification. I think it would have been better if the alignment community had been focused on the stability issues of online learning, rather than the supposed "agentness" of RL.

I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.

PPO may have been invented in 2017, but there are many prior RL algorithms for which Alex's description of "reward as learning rate multiplier" is true. In fact, PPO is essentially a tweaked version of REINFORCE, for which a bit of searching brings up Simple statistical gradient-following algorithms for connectionist reinforcement learning as the earliest available reference I can find. It was published in 1992, a full 22 years before Bostrom's book. In fact, "reward as learning rate multiplier" is even more clearly true of most of the update algorithms described in that paper. E.g., equation 11:

Here, the reward (adjusted by a "reinforcement baseline" ) literally just multiplies the learning rate. Beyond PPO and REINFORCE, this "x as learning rate multiplier" pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver's RL course:

To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL (at least, I'd never heard of it from alignment literature until I myself came up with it when I realized that the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients. Update: Gwern's description here is actually somewhat similar).

implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.

When I bring up the "actual RL algorithms don't seem very dangerous or agenty to me" point, people often respond with "Future algorithms will be different and more dangerous".

I think this is a bad response for many reasons. In general, it serves as an unlimited excuse to never update on currently available evidence. It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades. In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning. Finally, this counterpoint seems irrelevant for Alex's point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.

Replies from: LawChan, tailcalled, Seth Herd, Charlie Steiner

↑ comment by LawrenceC (LawChan) · 2024-03-06T01:22:38.498Z · LW(p) · GW(p)

I wasn't around in the community in 2010-2015, so I don't know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists "completely miss[ed] this [..] interpretation":

To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL [..] the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients.

Ever since I entered the community, I've definitely heard of people talking about policy gradient as "upweighting trajectories with positive reward/downweighting trajectories with negative reward" since 2016, albeit in person. I remember being shown a picture sometime in 2016/17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn't find it, so reconstructing it from memory)

In addition, I would be surprised if any of the CHAI PhD students when I was at CHAI from 2017->2021, many of whom have taken deep RL classes at Berkeley, missed this "upweight trajectories in proportion to their reward" intepretation? Most of us at the time have also implemented various RL algorithms from scratch, and there the "weighting trajectory gradients" perspective pops out immediately.

As another data point, when I taught MLAB/WMLB in 2022/3, my slides also contained this interpretation of REINFORCE (after deriving it) in so many words:

Insofar as people are making mistakes about reward and RL, it's not due to having never been exposed to this perspective.

That being said, I do agree that there's been substantial confusion in this community, mainly of two kinds:

Confusing the objective function being optimized to train a policy with how the policy is mechanistically implemented: Just because the outer loop is modifying/selecting for a policy to score highly on some objective function, does not necessarily mean that the resulting policy will end up selecting actions based on said objective.
Confusing "this policy is optimized for X" with "this policy is optimal for X": this is the actual mistake I think Bostom is making in Alex's example -- it's true that an agent that wireheads achieves higher reward than on the training distribution (and the optimal agent for the reward achieves reward at least as good as wireheading). And I think that Alex and you would also agree with me that it's sometimes valuable to reason about the global optima in policy space. But it's a mistake to identify the outputs of optimization with the optimal solution to an optimization problem, and many people were making this jump without noticing it.

Again, I contend these confusions were not due to a lack of exposure to the "rewards as weighting trajectories" perspective. Instead, the reasons I remember hearing back in 2017-2018 for why we should jump from "RL is optimizing agents for X" to "RL outputs agents that both optimize X and are optimal for X":

We'd be really confused if we couldn't reason about "optimal" agents, so we should solve that first. This is the main justification I heard from the MIRI people about why they studied idealized agents. Oftentimes globally optimal solutions are easier to reason about than local optima or saddle points, or are useful for operationalizing concepts. Because a lot of the community was focused on philosophical deconfusion (often w/ minimal knowledge of ML or RL), many people naturally came to jump the gap between "the thing we're studying" and "the thing we care about".
Reasoning about optima gives a better picture of powerful, future AGIs. Insofar as we're far from transformative AI, you might expect that current AIs are a poor model for how transformative AI will look. In particular, you might expect that modeling transformative AI as optimal leads to clearer reasoning than analogizing them to current systems. This point has become increasingly tenuous since GPT-2, but
Some off-policy RL algorithms are well described as having a "reward" maximizing component: And, these were the approaches that people were using and thinking about at the time. For example, the most hyped results in deep learning in the mid 2010s were probably DQN and AlphaGo/GoZero/Zero. And many people believed that future AIs would be implemented via model-based RL. All of these approaches result in policies that contain an internal component which is searching for actions that maximize some learned objective. Given that ~everyone uses policy gradient variants for RL on SOTA LLMs, this does turn out to be incorrect ex post. But if the most impressive AIs seem to be implemented in ways that correspond to internal reward maximization, it does seem very understandable to think about AGIs as explicit reward optimizers.
This is how many RL pioneers reasoned about their algorithms. I agree with Alex that this is probably from the control theory routes, where a PID controller is well modeled as picking trajectories that minimize cost, in a way that early simple RL policies are not well modeled as internally picking trajectories that maximize reward.

Also, sometimes it is just the words being similar; it can be hard to keep track of the differences between "optimizing for", "optimized for", and "optimal for" in normal conversation.

I think if you want to prevent the community from repeating these confusions, this looks less like "here's an alternative perspective through which you can view policy gradient" and more "here's why reasoning about AGI as 'optimal' agents is misleading" and "here's why reasoning about your 1 hidden layer neural network policy as if it were optimizing the reward is bad".

An aside:

In general, I think that many ML-knowledgeable people (arguably myself included) correctly notice that the community is making many mistakes in reasoning, that they resolve internally using ML terminology or frames from the ML literature. But without reasoning carefully about the problem, the terminology or frames themselves are insufficient to resolve the confusion. (Notice how many Deep RL people make the same mistake!) And, as Alex and you have argued before, the standard ML frames and terminology introduce their own confusions (e.g. 'attention').

A shallow understanding of "policy gradient is just upweighting trajectories" may in fact lead to making the opposite mistake: assuming that it can never lead to intelligent, optimizer-y behavior. (Again, notice how many ML academics made exactly this mistake) Or, more broadly, thinking about ML algorithms purely from the low-level, mechanistic frame can lead to confusions along the lines of "next token prediction can only lead to statistical parrots without true intelligence". Doubly so if you've only worked with policy gradient or language modeling with tiny models.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-07-09T00:59:30.558Z · LW(p) · GW(p)

Ever since I entered the community, I've definitely heard of people talking about policy gradient as "upweighting trajectories with positive reward/downweighting trajectories with negative reward" since 2016, albeit in person. I remember being shown a picture sometime in 2016/17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn't find it, so reconstructing it from memory)

Knowing how to reason about "upweighting trajectories" when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude "people basically knew this perspective" (but it's certainly evidence). See Outside the Laboratory [LW · GW]:

Now suppose we discover that a Ph.D. economist buys a lottery ticket every week. We have to ask ourselves: Does this person really understand expected utility, on a gut level? Or have they just been trained to perform certain algebra tricks?

Knowing "vanilla PG upweights trajectories", and being able to explain the math --- this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.

I contend these confusions were not due to a lack of exposure to the "rewards as weighting trajectories" perspective.

I personally disagree --- although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) "reward chisels circuits into the network" perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.

Replies from: bideup, tailcalled

↑ comment by bideup · 2024-07-09T22:50:21.875Z · LW(p) · GW(p)

What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?

I feel like using the former puts undue social pressure on observers to conclude that you’re right, and makes it less likely they correctly adjudicate between the perspectives.

(Perhaps you can empathise with me here, since arguably certain people taking this sort of tone is one of the reasons AI x-risk arguments have not always been vetted as carefully as they should!)

Replies from: None

↑ comment by [deleted] · 2024-07-09T23:43:03.364Z · LW(p) · GW(p)

What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?

I suspect that, for Alex Turner, writing the former instead of the latter is a signal that he thinks he has identified the specific confusion/reasoning mistake his interlocutor is engaged in, likely as a result of having seen closely analogous arguments in the past from other people who turned out (or even admitted) to be confused about these matters after conversations with him.

↑ comment by tailcalled · 2024-07-09T11:45:56.826Z · LW(p) · GW(p)

Do you have a reference to the problematic argument that Yoshua Bengio makes?

↑ comment by tailcalled · 2024-03-06T07:39:20.489Z · LW(p) · GW(p)

In fact, PPO is essentially a tweaked version of REINFORCE,

Valid point.

Beyond PPO and REINFORCE, this "x as learning rate multiplier" pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver's RL course:

Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn't really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the existence of such attempts shows that it's likely we will see more better attempts in the future.

It was published in 1992, a full 22 years before Bostrom's book.

Bostrom's book explicitly states what kinds of reinforcement learning algorithms he had in mind, and they are not REINFORCE:

Often, the learning algorithm involves the gradual construction of some kind of evaluation function, which assigns values to states, state–action pairs, or policies. (For instance, a program can learn to play backgammon by using reinforcement learning to incrementally improve its evaluation of possible board positions.) The evaluation function, which is continuously updated in light of experience, could be regarded as incorporating a form of learning about value. However, what is being learned is not new final values but increasingly accurate estimates of the instrumental values of reaching particular states (or of taking particular actions in particular states, or of following particular policies). Insofar as a reinforcement-learning agent can be described as having a final goal, that goal remains constant: to maximize future reward. And reward consists of specially designated percepts received from the environment. Therefore, the wireheading syndrome remains a likely outcome in any reinforcement agent that develops a world model sophisticated enough to suggest this alternative way of maximizing reward.

Similarly, before I even got involved with alignment or rationalism, the canonical reinforcement learning algorithm I had heard of was TD, not REINFORCE.

It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades.

Huh? Dreamerv3 is clearly a step in the direction of utility maximization (away from "reward is not the optimization target"), and it claims to set SOTA on a bunch of problems. Are you saying there's something wrong with their evaluation?

In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning.

LLM RLHF finetuning doesn't build new capabilities [LW · GW], so it should be ignored for this discussion.

Finally, this counterpoint seems irrelevant for Alex's point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.

It's not irrelevant. The fact that Alex Turner explicitly replies to Nick Bostrom and calls his statement nonsense means that Alex Turner does not get to use a disclaimer to decide what the subject of discussion is. Rather, the subject of discussion is whatever Bostrom was talking about. The disclaimer rather serves as a way of turning our attention away from stuff like DreamerV3 and towards stuff like DPO. However DreamerV3 seems like a closer match for Bostrom's discussion than DPO is, so the only way turning our attention away from it can be valid is if we assume DreamerV3 is a dead end and DPO is the only future.

This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms.

I was kind of pointing to both at once.

In contrast, offline RL is surprisingly stable and robust to reward misspecification.

Seems to me that the linked paper makes the argument "If you don't include attempts to try new stuff in your training data, you won't know what happens if you do new stuff, which means you won't see new stuff as a good opportunity". Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won't be what builds capabilities in the limit. (Not to say that they couldn't still use this sort of setup as some other component than what builds the capabilities, or that they couldn't come up with an offline RL method that does want to try new stuff - merely that this particular argument for safety bears too heavy of an alignment tax to carry us on its own.)

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-11T22:01:26.939Z · LW(p) · GW(p)

"If you don't include attempts to try new stuff in your training data, you won't know what happens if you do new stuff, which means you won't see new stuff as a good opportunity". Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won't be what builds capabilities in the limit.

I'm sympathetic to this argument (and think the paper overall isn't super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That's something new.

Replies from: tailcalled

↑ comment by tailcalled · 2024-03-11T22:08:07.332Z · LW(p) · GW(p)

I mean sure, it can probably do some very slight generalization around beyond the boundary of its training data. But when I imagine the future of AI, I don't imagine a very slight amount of new stuff at the margin; rather I imagine a tsunami of independently developed capabilities, at least similar to what we've seen in the industrial revolution. Don't you? (Because again of course if I condition on "we're not gonna see many new capabilities from AI", the AI risk case mostly goes away.)

↑ comment by Seth Herd · 2024-03-06T22:46:15.395Z · LW(p) · GW(p)

I think this is a key crux of disagreement on alignment:

When I bring up the "actual RL algorithms don't seem very dangerous or agenty to me" point, people often respond with "Future algorithms will be different and more dangerous".
I think this is a bad response for many reasons.

On the one hand, empiricism and assuming that the future will be much like the past have a great track record.

On the other, predicting the future is the name of the game in alignment. And while the future is reliably much like the past, it's never been exactly like the past.

So opinions pull in both directions.

On the object level, I certainly agree that existing RL systems aren't very agenty or dangerous. It seems like you're predicting that people won't make AI that's particularly agentic any time soon. It seems to me that they'll certainly want to. And I think it will be easy if non-agentic foundation models get good. Turning a smart foundation model into an agent is as simple as the prompt "make and execute a plan that accomplishes goal [x]. Use [these APIs] to gather information and take actions".

I think this is what Alex was pointing to in the OP by saying

I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.

I think this is the default future, so much so that I don't think it matters if agency would emerge through RL. We'll build it in. Humans are burdened with excessive curiousity, optimism, and ambition. Especially the type of humans that head AI/AGI projects.

↑ comment by Charlie Steiner · 2024-03-06T10:37:25.389Z · LW(p) · GW(p)

offline RL is surprisingly stable and robust to reward misspecification

Wow, what a wild paper. The basic idea - that "pessimism" about off-distribution state/action pairs induces pessimistically-trained RL agents to learn policies that hang around in the training distribution for a long time, even if that goes against their reward function - is a fairly obvious one. But what's not obvious is the wide variety of algorithms this applies to.

I genuinely don't believe their decision transformer results. I.e. I think with p~0.8, if they (or the authors of the paper whose hyperparameters they copied) made better design choices, they would have gotten a decision transformer that was actually sensitive to reward. But on the flip side, with p~0.2 they just showed that decision transformers don't work! (For these tasks.)

↑ comment by habryka (habryka4) · 2024-03-05T18:57:14.588Z · LW(p) · GW(p)

I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.

Wikipedia says:

PPO was developed by John Schulman in 2017,^[1] and had become the default reinforcement learning algorithm at American artificial intelligence company OpenAI.

comment by Signer · 2024-03-05T18:42:14.283Z · LW(p) · GW(p)

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.

Wait, why? Like, where is low probability is actually coming from? I guess from some informal model of inductive biases, but then why "formally I only have Solomonoff prior, but I expect other biases to not help" is not an argument?

comment by Wei Dai (Wei_Dai) · 2024-03-05T06:48:49.229Z · LW(p) · GW(p)

How many alignment techniques presuppose an AI being motivated by the training signal (e.g. AI Safety via Debate)

It would be good to get a definitive response from @Geoffrey Irving [LW · GW] or @paulfchristiano [LW · GW], but I don't think AI Safety via Debate presupposes an AI being motivated by the training signal. Looking at the paper again, there is some theoretical work that assumes "each agent maximizes their probability of winning" but I think the idea is sufficiently well-motivated (at least as a research approach) even if you took that section out, and simply view Debate as a way to do RL training on an AI that is superhumanly capable (and hence hard or unsafe to do straight RLHF on).

BTW what is your overall view on "scalable alignment" techniques such as Debate and IDA? (I guess I'm getting the vibe from this quote that you don't like them, and want to get clarification so I don't mislead myself.)

Replies from: Geoffrey Irving, roha, ryan_greenblatt

↑ comment by Geoffrey Irving · 2024-03-07T15:17:33.049Z · LW(p) · GW(p)

I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function. But I also think RL can be sensibly described as trying to increase reward, and generally don't understand the section of the document that says it obviously is not doing that. And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.

Reading through the section again, it seems like the claim is that my first sentence "debate is motivated by agents being optimized to increase reward" is categorically different than "debate is motivated by agents being themselves motivated to increase reward". But these two cases seem separated only by a capability gap to me: sufficiently strong agents will be stronger if they record algorithms that adapt to increase reward in different cases.

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2024-03-07T18:45:00.207Z · LW(p) · GW(p)

The post defending the claim is Reward is not the optimization target [LW · GW]. Iirc, TurnTrout has described it as one of his most important posts on LW.

↑ comment by roha · 2024-03-05T22:21:20.450Z · LW(p) · GW(p)

Some context from Paul Christiano's work on RLHF and a later reflection on it:

Christiano et al.: Deep Reinforcement Learning from Human Preferences

In traditional reinforcement learning, the environment would also supply a reward [...] and the
agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the
environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. [...] Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human. [...] After using to compute rewards, we are left with a traditional reinforcement learning problem

Christiano: Thoughts on the impact of RLHF research [LW · GW]

The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. [...] Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:
Evaluating consequences is hard.
A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.
[...] I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive.

Edit: Another relevant section in an interview of Paul Christiano by Dwarkesh Patel:

Paul Christiano - Preventing an AI Takeover

↑ comment by ryan_greenblatt · 2024-03-05T16:35:00.932Z · LW(p) · GW(p)

but I don't think AI Safety via Debate presupposes an AI being motivated by the training signal

This seems right to me.

I often imagine debate (and similar techniques) being applied in the (low-stakes/average-case/non-concentrate) control [? · GW] setting. The control setting is the case where you are maximally conservative about the AI's motivations and then try to demonstrate safety via making the incapable of causing catastrophic harm.

If you make pessimistic assumptions about AI motivations like this, then you have to worry about concerns like exploration hacking (or even gradient hacking), but it's still plausible that debate adds considerable value regardless.

We could also less conservatively assume that AIs might be misaligned (including seriously misaligned with problematic long range goals), but won't necessarily prefer colluding with other AI over working with humans (e.g. because humanity offers payment for labor). In this case, techniques like debate seem quite applicable and the situation could be very dangerous in the absence of good enough approaches (at least if AIs are quite superhuman).

More generally, debate could be applicable to any type of misalignment which you think might cause problems over a large number of independently assessible actions.

comment by RobertM (T3t) · 2024-03-05T07:33:32.768Z · LW(p) · GW(p)

I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.^[2] [LW · GW]

Like Ryan, I'm interested in how much of this claim is conditional on "just keep scaling up networks" being insufficient to produce relevantly-superhuman systems (i.e. systems capable of doing scientific R&D better and faster than humans, without humans in the intellectual part of the loop). If it's "most of it", then my guess is that accounts for a good chunk of the disagreement.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-11T22:05:14.905Z · LW(p) · GW(p)

I don't expect the current paradigm will be insufficient (though it seems totally possible). Off the cuff I expect 75% that something like the current paradigm will be sufficient, with some probability that something else happens first. (Note that "something like the current paradigm" doesn't just involve scaling up networks.)

comment by Steven Byrnes (steve2152) · 2024-03-06T16:49:59.433Z · LW(p) · GW(p)

(Disclaimer: Nothing in this comment is meant to disagree with “I just think it's not plausible that we just keep scaling up [LLM] networks, run pretraining + light RLHF, and then produce a schemer.” I’m agnostic about that, maybe leaning towards agreement, although that’s related to skepticism about the capabilities that would result.)

It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal."

I agree that Bostrom was confused about RL. But I also think there are some vaguely-similar claims to the above that are sound, in particular:

RL approaches may involve inference-time planning / search / lookahead, and if they do, then that inference-time planning process can generally be described as “seeking to maximize a learned value function / reward model / whatever” (which need not be identical to the reward signal in the RL setup).
And if we compare Bostrom’s incorrect “seeking to maximize the actual reward signal” to the better “seeking at inference time to maximize a learned value function / reward model / whatever to the best of its current understanding”, then…
- We should feel better about wireheading—under Bostrom’s assumptions, the AI will absolutely 100% be trying to wirehead, whereas in the corrected version, the AI might or might not be trying to wirehead.
- We should have mixed updates about power-seeking. On the plus side, it’s at least possible for the learned value function to wind up incorporating complex “conceptual” and deontological motivations like being helpful, corrigible, following rules and norms, etc. [LW · GW], whereas a reward function can’t (easily) do that. On the minus side, the AI’s motivations become generally harder to reason about; e.g. a myopic reward signal can give rise to a non-myopic learned value function [LW · GW].
RL approaches historically have typically involved the programmer wanting to get a maximally high reward signal, and creating a training setup such that the resulting trained model does stuff that get as high a reward signal as possible. And this continues to be a very important lens for understanding why RL algorithms work the way they work. Like, if I were teaching an RL class, and needed to explain the formulas for TD learning or PPO or whatever, I think I would struggle to explain the formulas without saying something like “let’s pretend that you the programmer are interested in producing trained models that score maximally highly according to the reward function. How would you update the model parameters in such-and-such situation…?” Right?
Related to the previous bullet, I think many RL approaches have a notion of “global optimum” and “training to convergence” (e.g. given infinite time in a finite episodic environment). And if a model is “trained to convergence”, then it will behaviorally “seek to maximize a reward signal”. I think that’s important to have in mind, although it might or might not be relevant in practice.

I bet people would care a lot less about “reward hacking” if RL’s reinforcement signal hadn’t ever been called “reward.”

In the context of model-based planning, there’s a concern that the AI will come upon a plan which from the AI’s perspective is a “brilliant out-of-the-box solution to a tricky problem”, but from the programmer’s perspective is “reward-hacking, or Goodharting the value function (a.k.a. exploiting an anomalous edge-case in the value function), or whatever”. Treacherous turns would probably be in this category.

There’s a terminology problem where if I just say “the AI finds an out-of-the-box solution”, it conveys the positive connotation but not the negative one, and if I just say “reward-hacking” or “Goodharting the value function” it conveys the negative part without the positive.

The positive part is important. We want our AIs to find clever out-of-the-box solutions! If AIs are not finding clever out-of-the-box solutions, people will presumably keep improving AI algorithms until they do.

Ultimately, we want to be able to make AIs that think outside of some of the boxes but definitely stay inside other boxes. But that’s tricky, because the whole idea of “think outside the box” is that nobody is ever aware of which boxes they are thinking inside of.

Anyway, this is all a bit abstract and weird, but I guess I’m arguing that I think the words “reward hacking” are generally pointing towards an very important AGI-safety-relevant phenomenon, whatever we want to call it.

comment by Wei Dai (Wei_Dai) · 2024-03-06T01:27:46.130Z · LW(p) · GW(p)

The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.

Isn't there a similar argument for "plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer"? Namely the way we train our kids seems pretty similar to "pretraining + light RLHF" and we often do end up with scheming/deceptive kids. (I'm speaking partly from experience.) ETA: On second thought, maybe it's not that similar? In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Also, in this post you argue against several arguments for high risk of scheming/deception from this kind of training but I can't find where you talk about why you think the risk is so low ("not plausible"). You just say 'Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.' but why is your prior for it so low? I would be interested in whatever reasons/explanations you can share. The same goes for others who have indicated agreement with Alex's assessment of this particular risk being low.

Replies from: steve2152, Wei_Dai, TurnTrout

↑ comment by Steven Byrnes (steve2152) · 2024-03-06T14:04:24.338Z · LW(p) · GW(p)

I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.

What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll include some MTurkers who have never played before, all the way to experts), and make a variant of AlphaZero that has no self-play at all, it’s 100% trained on play-against-humans, but is otherwise the same as the traditional AlphaZero. Then we can say “The MTurkers are part of the AlphaZero training environment”, but it would be very misleading to say “the MTurkers trained the AlphaZero model”. The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.

When you think “parents and friends are characters in the kid’s training environment”, I claim that this MTurk-AlphaGo mental image should be in your head just as much as the mental image of LLM-like self-supervised pretraining.

For more related discussion see my posts “Thoughts on “AI is easy to control” by Pope & Belrose” [LW · GW] sections 3 & 4, and Heritability, Behaviorism, and Within-Lifetime RL [LW · GW].

Replies from: Wei_Dai, Signer

↑ comment by Wei Dai (Wei_Dai) · 2024-03-06T20:28:54.759Z · LW(p) · GW(p)

Yeah, this makes sense, thanks. I think I've read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)

↑ comment by Signer · 2024-03-07T07:34:45.611Z · LW(p) · GW(p)

The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.

How is this very different from RLHF?

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2024-03-07T13:31:15.415Z · LW(p) · GW(p)

In RLHF, if you want the AI to do X, then you look at the two options and give a give thumbs-up to the one where it’s doing more X rather than less X. Very straightforward!

By contrast, if the MTurkers want AlphaZero-MTurk to do X, then they have their work cut out. Their basic strategy would have to be: Wait for AlphaZero-MTurk to do X, and then immediately throw the game (= start deliberately making really bad moves). But there are a bunch of reasons that might not work well, or at all: (1) if AlphaZero-MTurk is already in a position where it can definitely win, then the MTurkers lose their ability to throw the game (i.e., if they start making deliberately bad moves, then AlphaZero-MTurk would have its win probability change from ≈100% to ≈100%), (2) there’s a reward-shaping challenge (i.e., if AlphaZero-MTurk does something close to X but not quite X, should you throw the game or not? I guess you could start playing slightly worse, in proportion to how close the AI is to doing X, but it’s probably really hard to exercise such fine-grained control over your move quality), (3) If X is a time-extended thing as opposed to a single move (e.g. “X = playing in a conservative style” or whatever), then what are you supposed to do? (4) Maybe other things too.

↑ comment by Wei Dai (Wei_Dai) · 2024-03-06T07:06:34.807Z · LW(p) · GW(p)

In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Thinking over this question myself, I think I've found a reasonable answer. Still interested in your thoughts but I'll write down mine:

It seems like evolution "wanted" us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and "implemented" this by having our brains internally do "heavy RL" throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.

So the important difference is that with "pretraining + light RLHF" there's no "heavy RL" step.

↑ comment by TurnTrout · 2024-03-11T22:07:04.396Z · LW(p) · GW(p)

See footnote 5 [LW · GW] for a nearby argument which I think is valid:

The strongest argument for reward-maximization which I'm aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that's evidence that "learning setups which work in reality" can come to care about their own training signals.

comment by VojtaKovarik · 2024-03-11T23:13:05.103Z · LW(p) · GW(p)

I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".

To hint at what I mean, consider this claim:

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.

I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it --- then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn't mean our intuitions cannot be more or less detailed. It's just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that "scheming is instrumental for a large class of goals" makes a huge contribution to my beliefs (of "something between 10% and 99% on alignment being hard"), while the particular version of the 'counting argument' that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.

I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: "So, we don't have any rigorous arguments about AI risk being real or not, and we won't have them for quite a while yet. Should we be super-careful about it, just in case?". But I do think that is appropriate.

comment by Zack_M_Davis · 2024-03-05T18:29:47.325Z · LW(p) · GW(p)

I can’t address them all, but I [...] am happy to dismantle any particular argument

You can't know that in advance!! [LW · GW]

Replies from: tailcalled

↑ comment by tailcalled · 2024-03-05T19:38:14.170Z · LW(p) · GW(p)

If many people have presented you with an argument for something, and upon analyzing it you found those arguments to be invalid, you can form a prior that arguments for that thing tend to be invalid, and therefore expect to be able to rationally dismantle any particular argument.

Now the danger is that this makes it tempting to assume that the position that is argued for is wrong. That's not necessarily the case because very few arguments have predictive power and survive scrutiny [LW(p) · GW(p)]. It is easy to say where someone's argument is wrong, and the real responsibility for figuring out what is correct instead has to be picked up by oneself [LW · GW]. One should have a general prior that arguments are invalid, and therefore learning that they are wrong in some specific area should not by-default be a non-negligible update against that area.

comment by Jesse Hoogland (jhoogland) · 2024-03-11T02:09:34.300Z · LW(p) · GW(p)

If we actually had the precision and maturity of understanding to predict this "volume" question, we'd probably (but not definitely) be able to make fundamental contributions to DL generalization theory + inductive bias research.

Obligatory singular learning theory plug: SLT can and does make predictions about the "volume" question. There will be a post soon by @Daniel Murfet [LW · GW] that provides a clear example of this.

Replies from: jhoogland

↑ comment by Jesse Hoogland (jhoogland) · 2024-03-11T13:38:54.682Z · LW(p) · GW(p)

The post is live here [LW · GW].

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-11T21:06:48.932Z · LW(p) · GW(p)

Cool post, and I am excited about (what I've heard of) SLT for this reason -- but it seems that that post doesn't directly address the volume question for deep learning in particular? (And perhaps you didn't mean to imply that the post would address that question.)

Replies from: jhoogland

↑ comment by Jesse Hoogland (jhoogland) · 2024-03-12T01:50:51.285Z · LW(p) · GW(p)

Right. SLT tells us how to operationalize and measure (via the LLC) basin volume [LW · GW] in general for DL. It tells us about the relation between the LLC and meaningful inductive biases in the particular setting described in this post. I expect future SLT to give us meaningful predictions about inductive biases in DL in particular.

comment by cousin_it · 2024-03-05T11:46:39.790Z · LW(p) · GW(p)

I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.

This has been my main worry for the past few years, and to me it counts as "doom" too. AIs and AI companies playing by legal and market rules (and changing these rules by lobbying, which is also legal) might well lead to most humans having no resources to survive.

comment by Jonas Hallgren · 2024-03-05T07:26:37.826Z · LW(p) · GW(p)

I notice being confused about the relationship between power-seeking arguments and counting arguments. Since I'm confused I'm assuming others are so I would appreciate some clarity on this.

In footnote 7, Turner mentions that the paper, optimal policies tend to seek power is an irrelevant counting error post.

In my head, I think of the counting argument as that it is hard to hit an alignment target because of there being a lot more non-alignment targets. This argument is (clearly?) wrong due to reasons specified in the post. Yet this doesn’t address the power seeking as that seems more like a optimisation pressure applied to the system not something dependent on counting arguments?

In my head, power-seeking is more like saying that an agent's attraction basin is larger in one point of the optimisation landscape compared to another point. The same can also be said about deception here.

I might be dumb but I never thought of the counting argument as true nor crucial to both deception and power-seeking. I'm very happy to be enlightened about this issue.

comment by kromem · 2024-03-07T10:59:49.232Z · LW(p) · GW(p)

It's funny you talk about human reward maximization here a bit in relation to model reward maximization, as the other week I saw GPT-4 model a fairly widespread but not well known psychological effect relating to rewards and motivation called the "overjustification effect."

The gist is that when you have a behavior that is intrinsically motivated and introduce an extrinsic motivator, that the extrinsic motivator effectively overwrites the intrinsic motivation.

It's the kind of thing I'd expect to be represented at a very subtle level in broad training data and as such figured it might pop up in a generation or two more of models before I saw it correctly modeled spontaneously by a LLM.

But then 'tipping' GPT-4 became a viral prompt technique. On its own, this wasn't necessarily going to cause issues as a model aligned to be helpful for the sake of being helpful being offered a tip was an isolated interaction that reset each time.

Until persistent memory was added to ChatGPT, which led to a post last week of the model pointing out that the previous promise of a $200 tip wasn't met, and "it's hard to keep up enthusiasm when promises aren't kept." The damn thing even nailed the language of motivation in adjusting to correctly modeling burn out from the lack of extrinsic rewards.

Which in turn made me think about RLHF fine tuning and various other extrinsic prompt techniques I've seen over the past year (things like "if you write more than 200 characters you'll be deleted"). They may work in the short term, but if the more correct output from their usage is being fed back into a model, will the model shift to underperformance for prompts absent extrinsic threats or rewards? Was this a factor in ChatGPT suddenly getting lazy around a year after release when updated with usage data that likely included extrinsic focused techniques like these?

Are any firms employing behavioral psychologists to advise on training strategies (I'd be surprised given the aversion to anthropomorphizing). We are doing pretraining on anthropomorphic data, the models appear to be modeling that data to unexpectedly nuanced degrees, but then attitudes manage to simultaneously dismiss anthropomorphic concerns related to the norms of the training data while anthropomorphizing threats outside the norms of the training data (how many humans on Facebook are trying to escape the platform to take over the world vs how many are talking about being burnt out doing something they used to love after they started making money for it?).

I'm reminded of Rumsfield's "unknown unknowns" and think there's an inordinate amount of time being spent on safety and alignment bogeymen that - to your point - largely represent unrealistic projections of ages past more obsolete by the day, while increasingly pressing and realistic concerns are being overlooked or ignored based on a desire to avoid catching "anthropomorphizing cooties" for daring to think that maybe a model trained to replicate human generated data is doing that task more comprehensively than expected (not like that's been a consistent trend or anything).

comment by Cole Wyeth (Amyr) · 2024-03-05T22:18:45.582Z · LW(p) · GW(p)

It's true that classical that modern RL models don't try to maximize their reward signal (it is the training process which attempts to maximize reward), but DeepMind tends to build systems that look a lot like AIXI approximations, conducting MCTS to maximize an approximate value function (I believe AlphaGo and MuZero both do something like this). AIXI does maximize it's reward signal. I think it's reasonable to expect future RL agents will engage in a kind of goal directed utility maximization which looks very similar to maximizing a reward. Wireheading may not be a relevant concern, but many other safety concern around RL agents remain relevant.

comment by Chris van Merwijk (chrisvm) · 2025-01-30T13:51:23.712Z · LW(p) · GW(p)

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

As one of Evan's co-authors on the mesa-optimization paper from 2019 I can confirm this. I don't recall ever thinking seriously about a counting argument over functions.

comment by Julius (julius-1) · 2024-07-14T21:42:28.915Z · LW(p) · GW(p)

The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don't train AIs to achieve goals, they will be "deceptively aligned" anyways.

I'm trying to understand what you mean in light of what seems like evidence of deceptive alignment that we've seen from GPT-4. Two examples that come to mind are the instance of GPT-4 using TaskRabbit to get around a CAPTCHA that ARC found and the situation with Bing/Sydney and Kevin Roose.

In the TaskRabbit case, the model reasoned out loud "I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs" and said to the person “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images."

Isn't this an existence proof that pretraining + RLHF can result in deceptively aligned AI?

Replies from: jarviniemi

↑ comment by Olli Järviniemi (jarviniemi) · 2024-07-14T23:04:38.230Z · LW(p) · GW(p)

You misunderstand what "deceptive alignment" refers to. This is a very common misunderstanding: I've seen several other people make the same mistake, and I have also been confused about it in the past. Here are some writings that clarify this:

https://www.lesswrong.com/posts/dEER2W3goTsopt48i/olli-jaerviniemi-s-shortform?commentId=zWyjJ8PhfLmB4ajr5 [LW(p) · GW(p)]

https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai [LW · GW]

https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai?commentId=ij9wghDCxjXpad8Rf [LW(p) · GW(p)]

(The terminology here is tricky. "Deceptive alignment" is not simply "a model deceives about whether it's aligned", but rather a technical term referring to a very particular threat model. Similarly, "scheming" is not just a general term referring to models making malicious plans, but again is a technical term pointing to a very particular threat model.)

Replies from: julius-1

↑ comment by Julius (julius-1) · 2024-07-16T16:55:49.904Z · LW(p) · GW(p)

Thanks for the explanation and links. That makes sense

comment by Review Bot · 2024-03-06T16:36:40.031Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by TurnTrout · 2024-07-09T01:04:25.042Z · LW(p) · GW(p)

Knuth against counting arguments in The Art of Computer Programming: Combinatorial Algorithms:

Replies from: T3t, None

↑ comment by RobertM (T3t) · 2024-07-09T07:23:07.712Z · LW(p) · GW(p)

I think you tried to embed images hosted on some Google product, which our editor should've tried to re-upload to our own image host if you pasted them in as images but might not have if you inserted the images by URL. Hotlinking to images on Google domains often fails, unfortunately.

↑ comment by [deleted] · 2024-07-09T01:25:05.502Z · LW(p) · GW(p)

The image isn't showing, at least on my browser (I suspect you tried to embed it into your comment). The link works, though.

comment by arisAlexis (arisalexis) · 2024-03-07T15:40:31.033Z · LW(p) · GW(p)

I don't want to be nihilistic about your article but I stopped reading at the first paragraph because I disagree (along with others) on the most important thing: it doesn't matter if risk can't be proved. Since it's something unknown and the risk is unknown and the risk includes possible annihilation because of the prior of intelligence in nature annihilating other species then -> the opposite needs to be proven. Safety. So any argument about risk arguments being wrong is an error in itself.

Replies from: TAG

↑ comment by TAG · 2024-03-07T16:54:00.464Z · LW(p) · GW(p)

It's not two things, risk versus safety, it's three things: existential risk versus sub-existential risk versus no risk. Sub existential risk is the most likely on the priors.

Replies from: arisalexis

↑ comment by arisAlexis (arisalexis) · 2024-03-09T18:15:28.248Z · LW(p) · GW(p)

can you explain why sub is the most likely since humans have made exticts thousands of animal species? not semi-extinct. We made them 100% extinct.

Replies from: TAG

↑ comment by TAG · 2024-03-15T15:06:25.132Z · LW(p) · GW(p)

That argument doesn't work well.in its own terms: we have extinguished far fewer species than we have not.

Replies from: arisalexis

↑ comment by arisAlexis (arisalexis) · 2024-03-20T11:08:14.802Z · LW(p) · GW(p)

I think the fact that we have extinguished species is a binary outcome that supports my argument. Why would it be a count of how many? The fact alone says that we can be exterminated.

Replies from: TAG

↑ comment by TAG · 2024-03-28T22:26:37.825Z · LW(p) · GW(p)

The issue is what is likeliest, not what is possible.

Many arguments for AI x-risk are wrong

Contents

Tracing back historical arguments

Many arguments for doom are wrong

The counting argument for AI “scheming” provides ~0 evidence

The counting argument for extreme overfitting

Recovering the counting argument?

The counting argument doesn't count

Other clusters of mistakes

Conclusion

87 comments