Many arguments for AI x-risk are wrong 2024-03-05T02:31:00.990Z
Dreams of AI alignment: The danger of suggestive names 2024-02-10T01:22:51.715Z
Steering Llama-2 with contrastive activation additions 2024-01-02T00:47:04.621Z
How should TurnTrout handle his DeepMind equity situation? 2023-10-16T18:25:38.895Z
Paper: Understanding and Controlling a Maze-Solving Policy Network 2023-10-13T01:38:09.147Z
AI presidents discuss AI alignment agendas 2023-09-09T18:55:37.931Z
ActAdd: Steering Language Models without Optimization 2023-09-06T17:21:56.214Z
Open problems in activation engineering 2023-07-24T19:46:08.733Z
Ban development of unpredictable powerful models? 2023-06-20T01:43:11.574Z
Mode collapse in RL may be fueled by the update equation 2023-06-19T21:51:04.129Z
Think carefully before calling RL policies "agents" 2023-06-02T03:46:07.467Z
Steering GPT-2-XL by adding an activation vector 2023-05-13T18:42:41.321Z
Residual stream norms grow exponentially over the forward pass 2023-05-07T00:46:02.658Z
Behavioural statistics for a maze-solving agent 2023-04-20T22:26:08.810Z
[April Fools'] Definitive confirmation of shard theory 2023-04-01T07:27:23.096Z
Maze-solving agents: Add a top-right vector, make the agent go to the top-right 2023-03-31T19:20:48.658Z
Understanding and controlling a maze-solving policy network 2023-03-11T18:59:56.223Z
Predictions for shard theory mechanistic interpretability results 2023-03-01T05:16:48.043Z
Parametrically retargetable decision-makers tend to seek power 2023-02-18T18:41:38.740Z
Some of my disagreements with List of Lethalities 2023-01-24T00:25:28.075Z
Positive values seem more robust and lasting than prohibitions 2022-12-17T21:43:31.627Z
Inner and outer alignment decompose one hard problem into two extremely hard problems 2022-12-02T02:43:20.915Z
Alignment allows "nonrobust" decision-influences and doesn't require robust grading 2022-11-29T06:23:00.394Z
Don't align agents to evaluations of plans 2022-11-26T21:16:23.425Z
Don't design agents which exploit adversarial inputs 2022-11-18T01:48:38.372Z
People care about each other even though they have imperfect motivational pointers? 2022-11-08T18:15:32.023Z
A shot at the diamond-alignment problem 2022-10-06T18:29:10.586Z
Four usages of "loss" in AI 2022-10-02T00:52:35.959Z
Bruce Wayne and the Cost of Inaction 2022-09-30T00:19:47.335Z
Understanding and avoiding value drift 2022-09-09T04:16:48.404Z
The shard theory of human values 2022-09-04T04:28:11.752Z
Seriously, what goes wrong with "reward the agent when it makes you smile"? 2022-08-11T22:22:32.198Z
General alignment properties 2022-08-08T23:40:47.176Z
Reward is not the optimization target 2022-07-25T00:03:18.307Z
Humans provide an untapped wealth of evidence about alignment 2022-07-14T02:31:48.575Z
Human values & biases are inaccessible to the genome 2022-07-07T17:29:56.190Z
Looking back on my alignment PhD 2022-07-01T03:19:59.497Z
Emotionally Confronting a Probably-Doomed World: Against Motivation Via Dignity Points 2022-04-10T18:45:08.027Z
Do a cost-benefit analysis of your technology usage 2022-03-27T23:09:26.753Z
ELK Proposal: Thinking Via A Human Imitator 2022-02-22T01:52:41.794Z
Instrumental Convergence For Realistic Agent Objectives 2022-01-22T00:41:36.649Z
Formalizing Policy-Modification Corrigibility 2021-12-03T01:31:42.011Z
A Certain Formalization of Corrigibility Is VNM-Incoherent 2021-11-20T00:30:48.961Z
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability 2021-11-18T01:54:33.589Z
Transcript: "You Should Read HPMOR" 2021-11-02T18:20:53.161Z
Insights from Modern Principles of Economics 2021-09-22T05:19:55.747Z
When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives 2021-08-09T17:22:24.056Z
Seeking Power is Convergently Instrumental in a Broad Class of Environments 2021-08-08T02:02:18.975Z
The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies 2021-07-11T17:36:24.208Z
A world in which the alignment problem seems lower-stakes 2021-07-08T02:31:03.674Z


Comment by TurnTrout on Non-myopia stories · 2024-04-05T01:59:42.563Z · LW · GW

As Turntrout has already noted, that does not apply to model-based algorithms, and they 'do optimize the reward':

I think that you still haven't quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)

In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you're doing MCTS (or "full-blown backwards induction") on reward for the leaf nodes, the system optimizes the reward. That is -- if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you're optimizing for reward. If you're doing e.g. AlphaZero, that aggregate system isn't optimizing for reward. 

Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I'm trying to communicate. You should be aware that I don't think you understand my views or that post's intended lesson. As I offered before, I'd be open to discussing this more at length if you want clarification. 

CC @faul_sname 

Comment by TurnTrout on 'Empiricism!' as Anti-Epistemology · 2024-03-18T19:00:13.341Z · LW · GW

This scans as less "here's a helpful parable for thinking more clearly" and more "here's who to sneer at" -- namely, at AI optimists. Or "hopesters", as Eliezer recently called them, which I think is a play on "huckster" (and which accords with this essay analogizing optimists to Ponzi scheme scammers). 

I am saddened (but unsurprised) to see few others decrying the obvious strawmen:

what if [the optimists] cried 'Unfalsifiable!' when we couldn't predict whether a phase shift would occur within the next two years exactly?


"But now imagine if -- like this Spokesperson here -- the AI-allowers cried 'Empiricism!', to try to convince you to do the blindly naive extrapolation from the raw data of 'Has it destroyed the world yet?' or 'Has it threatened humans? no not that time with Bing Sydney we're not counting that threat as credible'."

Thinly-veiled insults:

Nobody could possibly be foolish enough to reason from the apparently good behavior of AI models too dumb to fool us or scheme, to AI models smart enough to kill everyone; it wouldn't fly even as a parable, and would just be confusing as a metaphor.

and insinuations of bad faith:

What if, when you tried to reason about why the model might be doing what it was doing, or how smarter models might be unlike stupider models, they tried to shout you down for relying on unreliable theorizing instead of direct observation to predict the future?"  The Epistemologist stopped to gasp for breath.

"Well, then that would be stupid," said the Listener.

"You misspelled 'an attempt to trigger a naive intuition, and then abuse epistemology in order to prevent you from doing the further thinking that would undermine that naive intuition, which would be transparently untrustworthy if you were allowed to think about it instead of getting shut down with a cry of "Empiricism!"'," said the Epistemologist.

Apparently Eliezer decided to not take the time to read e.g. Quintin Pope's actual critiques, but he does have time to write a long chain of strawmen and smears-by-analogy.

As someone who used to eagerly read essays like these, I am quite disappointed. 

Comment by TurnTrout on Richard Ngo's Shortform · 2024-03-18T18:32:21.254Z · LW · GW

Nope! I have basically always enjoyed talking with you, even when we disagree.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-11T22:54:48.602Z · LW · GW

As I've noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe's report) which rules out the person intending the argument to be about function space. (E.g., people say things like "bits" and "complexity in terms of the world model".)

Aren't these arguments about simplicity, not counting? 

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-11T22:49:09.376Z · LW · GW

I think they meant that there is an evidential update from "it's economically useful" upwards on "this way of doing things tends to produce human-desired generalization in general and not just in the specific tasks examined so far." 

Perhaps it's easy to consider the same style of reasoning via: "The routes I take home from work are strongly biased towards being short, otherwise I wouldn't have taken them home from work."

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-11T22:45:29.006Z · LW · GW

Sorry, I do think you raised a valid point! I had read your comment in a different way.

I think I want to have said: aggressively training AI directly on outcome-based tasks ("training it to be agentic", so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it's worth distinguishing between these kinds of argument. 

Comment by TurnTrout on Richard Ngo's Shortform · 2024-03-11T22:31:24.630Z · LW · GW

In other words, shard advocates seem so determined to rebut the "rational EU maximizer" picture that they're ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

Personally, I'm not ignoring that question, and I've written about it (once) in some detail. Less relatedly, I've talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai

It's not that there isn't more shard theory content which I could write, it's that I got stuck and burned out before I could get past the 101-level content. 

I felt 

  • a) gaslit by "I think everyone already knew this" or even "I already invented this a long time ago" (by people who didn't seem to understand it); and that 
  • b) I wasn't successfully communicating many intuitions;[1] and 
  • c) it didn't seem as important to make theoretical progress anymore, especially since I hadn't even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network). 

So I didn't want to post much on the site anymore because I was sick of it, and decided to just get results empirically.

In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

I've always read "assume heuristics" as expecting more of an "ensemble of shallow statistical functions" than "a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed." Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.  

  1. ^

    The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time. 

Comment by TurnTrout on TurnTrout's shortform feed · 2024-03-11T22:16:35.128Z · LW · GW

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

Thanks for pointing out that distinction! 

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T22:07:04.396Z · LW · GW

See footnote 5 for a nearby argument which I think is valid:

The strongest argument for reward-maximization which I'm aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that's evidence that "learning setups which work in reality" can come to care about their own training signals.

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T22:05:14.905Z · LW · GW

I don't expect the current paradigm will be insufficient (though it seems totally possible). Off the cuff I expect 75% that something like the current paradigm will be sufficient, with some probability that something else happens first. (Note that "something like the current paradigm" doesn't just involve scaling up networks.)

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T22:01:26.939Z · LW · GW

"If you don't include attempts to try new stuff in your training data, you won't know what happens if you do new stuff, which means you won't see new stuff as a good opportunity". Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won't be what builds capabilities in the limit.

I'm sympathetic to this argument (and think the paper overall isn't super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That's something new.

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T21:45:10.213Z · LW · GW

'reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it's not like humans or AGI or superintelligences would ever do crazy stuff like "plan" or "reason" or "search"'.

If you're going to mock me, at least be correct when you do it! 

I think that reward is still not the optimization target in AlphaZero (the way I'm using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system 

  • directly optimizes for the reinforcement signal, or 
  • "cares" about that reinforcement signal, 
  • or "does its best" to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger). 

If most of the "optimization power" were coming from e.g. MCTS on direct reward signal, then yup, I'd agree that the reward signal is the primary optimization target of this system. That isn't the case here.

You might use the phrase "reward as optimization target" differently than I do, but if we're just using words differently, then it wouldn't be appropriate to describe me as "ignoring planning."

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T21:30:35.910Z · LW · GW

To add, here's an excerpt from the Q&A on How likely is deceptive alignment? :

Question: When you say model space, you mean the functional behavior as opposed to the literal parameter space?

Evan: So there’s not quite a one to one mapping because there are multiple implementations of the exact same function in a network. But it's pretty close. I mean, most of the time when I'm saying model space, I'm talking either about the weight space or about the function space where I'm interpreting the function over all inputs, not just the training data.

I only talk about the space of functions restricted to their training performance for this path dependence concept, where we get this view where, well, they end up on the same point, but we want to know how much we need to know about how they got there to understand how they generalize.

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T21:25:42.623Z · LW · GW

Agree with a bunch of these points. EG in Reward is not the optimization target  I noted that AIXI really does maximize reward, theoretically. I wouldn't say that AIXI means that we have "produced" an architecture which directly optimizes for reward, because AIXI(-tl) is a bad way to spend compute. It doesn't actually effectively optimize reward in reality. 

I'd consider a model-based RL agent to be "reward-driven" if it's effective and most of its "optimization" comes from the direct part and not the leaf-node evaluation (as in e.g. AlphaZero, which was still extremely good without the MCTS). 

I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models. 

"Direct" optimization has not worked - at scale - in the past. Do you think that's going to change, and if so, why? 

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T21:08:54.437Z · LW · GW

Thanks for asking. I do indeed think that setup could be a very bad idea. You train for agency, you might well get agency, and that agency might be broadly scoped. 

(It's still not obvious to me that that setup leads to doom by default, though. Just more dangerous than pretraining LLMs.)

Comment by TurnTrout on Many arguments for AI x-risk are wrong · 2024-03-11T21:06:48.932Z · LW · GW

Cool post, and I am excited about (what I've heard of) SLT for this reason -- but it seems that that post doesn't directly address the volume question for deep learning in particular? (And perhaps you didn't mean to imply that the post would address that question.)

Comment by TurnTrout on Simple versus Short: Higher-order degeneracy and error-correction · 2024-03-11T21:02:38.311Z · LW · GW
  • It is not known whether the inductive bias of neural network training contains a preference for run-time error-correction. The phenomenon of "backup heads" observed in transformers seems like a good candidate. Can you think of others?

I've heard thirdhand (?) of a transformer whose sublayers  will dampen their outputs when  is added to that sublayer's input. IE there might be a "target" amount of  to have in the residual stream after that sublayer, and the sublayer itself somehow responds to ensure that happens? 

If there was some abnormality and there was already a bunch of  present, then the sublayer "error corrects" by shrinking its output.

Comment by TurnTrout on TurnTrout's shortform feed · 2024-03-07T01:59:55.142Z · LW · GW Unlearning dangerous knowledge by using steering vectors to define a loss function over hidden states. in particular, the ("I am a novice at bioweapons" - "I am an expert at bioweapons") vector. lol.

(it seems to work really well!)

Comment by TurnTrout on TurnTrout's shortform feed · 2024-03-07T01:01:38.034Z · LW · GW

Apparently[1] there was recently some discussion of Survival Instinct in Offline Reinforcement Learning (NeurIPS 2023). The results are very interesting: 

On many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies...

Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior with imperfect reward but purposely biased data coverage.

But I heard that some people found these results “too good to be true”, with some dismissing it instantly as wrong or mis-stated. I find this ironic, given that the paper was recently published in a top-tier AI conference. Yes, papers can sometimes be bad, but… seriously? You know the thing where lotsa folks used to refuse to engage with AI risk cuz it sounded too weird, without even hearing the arguments? … Yeaaah, absurdity bias.

Anyways, the paper itself is quite interesting. I haven’t gone through all of it yet, but I think I can give a good summary. The is a nice (but nonspecific) summary.


It’s super important to remember that we aren’t talking about PPO. Boy howdy, we are in a different part of town when it comes to these “offline” RL algorithms (which train on a fixed dataset, rather than generating more of their own data “online”). ATAC, PSPI, what the heck are those algorithms? The important-seeming bits:

  1. Many offline RL algorithms pessimistically initialize the value of unknown states
    1. “Unknown” means: “Not visited in the offline state-action distribution”
    2. Pessimistic means those are assigned a super huge negative value (this is a bit simplified)
  2. Because future rewards are discounted, reaching an unknown state-action pair is bad if it happens soon and less bad if it happens farther in the future
  3. So on an all-zero reward function, a model-based RL policy will learn to stay within the state-action pairs it was demonstrated for as long as possible (“length bias”)
    1. In the case of the gridworld, this means staying on the longest demonstrated path, even if the red lava is rewarded and the yellow key is penalized. 
    2. In the case of Hopper, I’m not sure how they represented the states, but if they used non-tabular policies, this probably looks like “repeat the longest portion of demonstrated policies without falling over” (because that leads to the pessimistic penalty, and most of the data looked like walking successfully due to length bias, so that kind of data is least likely to be penalized). 
  4. On a negated reward function (which e.g. penalizes the Hopper for staying upright and rewards for falling over), if falling over still leads to a terminal/unknown state-action, that leads to a huge negative penalty. So it’s optimal to keep hopping whenever

For example, if the original per-timestep reward for staying upright was 1, and the original penalty for falling over was -1, then now the policy gets penalized for staying upright and rewarded for falling over! At , it's therefore optimal to stay upright whenever

which holds whenever the pessimistic penalty is at least 12.3. That's not too high, is it? (When I was in my graduate RL class, we'd initialize the penalties to -1000!)


DPO, for example, is an offline RL algorithm. It's plausible that frontier models will be trained using that algorithm. So, these results are more relevant if future DPO variants use pessimism and if the training data (e.g. example user/AI interactions) last for more turns when they’re actually helpful for the user.

While it may be tempting to dismiss these results as irrelevant because “length won’t perfectly correlate with goodness so there won’t be positive bias”, I think that would be a mistake. When analyzing the performance and alignment properties of an algorithm, I think it’s important to have a clear picture of all relevant pieces of the algorithm. The influence of length bias and the support of the offline dataset are additional available levers for aligning offline RL-trained policies.

To close on a familiar note, this is yet another example of how “reward” is not the only important quantity to track in an RL algorithm. I also think it's a mistake to dismiss results like this instantly; this offers an opportunity to reflect on what views and intuitions led to the incorrect judgment.

  1. ^

    I can't actually check because I only check that stuff on Mondays.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T06:52:24.849Z · LW · GW

Your comment is switching the hypothesis being considered. As I wrote elsewhere:

Seems to me that a lot of (but not all) scheming speculation is just about sufficiently large pretrained predictive models, period. I think it's worth treating these cases separately. My strong objections are basically to the "and then goal optimization is a good way to minimize loss in general!" steps.

If the argument for scheming is "we will train them directly to achieve goals in a consequentialist fashion", then we don't need all this complicated reasoning about UTM priors or whatever. 

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T06:49:22.388Z · LW · GW

deceptive alignment that I like are and always have been about parameterizations rather than functions.

How can this be true, when you e.g. say there's "only one saint"? That doesn't make any sense with parameterizations due to internal invariances; there are uncountably many "saints" in parameter-space (insofar as I accept that frame, which I don't really but that's not the point here). I'd expect you to raise that as an obvious point in worlds where this really was about parameterizations.

And, as you've elsewhere noted, we don't know enough about parameterizations to make counting arguments over them. So how are you doing that?

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T06:41:32.311Z · LW · GW

But that should never lead you to do a counting argument over function space, since that is never a sound thing to do.

Do you agree that "instrumental convergence -> meaningful evidence for doom" is also unsound, because it's a counting argument that most functions of shape Y have undesirable property X?

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T02:17:50.620Z · LW · GW

afaict my critique remains valid. My criticism is precisely that counting arguments over function space aren't generally well-defined, and even if they were they wouldn't be the right way to run a counting argument.

Going back through the post, Nora+Quintin indeed made a specific and perfectly formalizable claim here: 

These results strongly suggest that SGD is not doing anything like sampling uniformly at random from the set of representable functions that do well on the training set.

They're making a perfectly valid point. The point was in the original post AFAICT -- it wasn't just only now explained by me. I agree that they could have presented it more clearly, but that's a way different critique than you're "using reasoning that doesn't actually correspond to any well-defined mathematical object."

regardless the point remains that the authors haven't engaged with the sort of counting arguments that I actually think are valid.

If that's truly your remaining objection, then I think that you should retract the unmerited criticisms about how they're trying to prove 0.9999... != 1 or whatever. In my opinion, you have confidently misrepresented their arguments, and the discussion would benefit from your revisions.


And then it'd be nice if someone would provide links to the supposed valid counting arguments! From my perspective, it's very frustrating to hear that there (apparently) are valid counting arguments but also they aren't the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren't linkable.)

If that's truly the state of the evidence, then I'm happy to just conclude that Nora+Quintin are right, and update if/when actually valid arguments come along.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T02:02:53.286Z · LW · GW

I don't think that's enough. Lookup tables can also be under "selection pressure" to output good training outputs. As I understand your reasoning, the analogy is too loose to be useful here. I'm worried that using 'selection pressure' is obscuring the logical structure of your argument. As I'm sure you'll agree, just calling that situation 'selection pressure' and SGD 'selection pressure' doesn't mean they're related.

I agree that "sometimes humans do X" is a good reason to consider whether X will happen, but you really do need shared causal mechanisms. If I examine the causal mechanisms here, I find things like "humans seem to have have 'parameterizations' which already encode situationally activated consequentialist reasoning", and then I wonder "will AI develop similar cognition?" and then that's the whole thing I'm trying to answer to begin with. So the fact you mention isn't evidence for the relevant step in the process (the step where the AI's mind-design is selected to begin with).

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:49:38.293Z · LW · GW

I think you should have asked for clarification before making blistering critiques about how Nora "ended up using reasoning that doesn't actually correspond to any well-defined mathematical object." I think your comments paint a highly uncharitable and (more importantly) incorrect view of N/Q's claims.

My response there is just that of course you shouldn't run a counting argument over function space—I would never suggest that.

Your presentations often include a counting argument over a function space, in the form of "saints" versus "schemers" and "sycophants." So it seems to me that you do suggest that. What am I missing?

I also welcome links to counting arguments which you consider stronger. I know you said you haven't written one up yet to your satisfaction, but surely there have to be some non-obviously wrong and weak arguments written up, right?

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:41:40.882Z · LW · GW

This is NOT what the evidence supports, and super misleadingly phrased. (Either that, or it's straightup magical thinking, which is worse)

The inductive biases / simplicity biases of deep learning are poorly understood but they almost certainly don't have anything to do with what humans want, per se.

Seems like a misunderstanding. It seems to me that you are alleging that Nora/Quintin believe there is a causal arrow from "Humans want X generalization" to "NNs have X generalization"? If so, I think that's an uncharitable reading of the quoted text.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:35:46.140Z · LW · GW
  • I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don't think it's literally "no evidence" for the claim here: that future AIs will scheme.

I agree, they're wrong to claim it's "no evidence." I think that counting arguments are extremely slight evidence against scheming, because they're weaker than the arguments I'd expect our community's thinkers to find in worlds where scheming was real. (Although I agree that on the object-level and in isolation, the arguments are tiiiny positive evidence.)

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:33:27.899Z · LW · GW

For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly.

The vast majority of evidential labor is done in order to consider a hypothesis at all

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:30:22.312Z · LW · GW

Suppose that I’m looking down at a superintelligent model newly trained on diverse, long-horizon tasks.

Seems to me that a lot of (but not all) scheming speculation is just about sufficiently large pretrained predictive models, period. I think it's worth treating these cases separately. My strong objections are basically to the "and then goal optimization is a good way to minimize loss in general!" steps.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:26:00.625Z · LW · GW

Yes, that's exactly the problem: you tried to make a counting argument, but because you didn't engage with the proper formalism, you ended up using reasoning that doesn't actually correspond to any well-defined mathematical object.

Analogously, it's like you wrote an essay about why 0.999... != 1 and your response to "under the formalism of real numbers as Dedekind cuts, those are identical" was "where did I say I was referring to Dedekind cuts?"

No. I think you are wrong. This passage makes me suspect that you didn't understand the arguments Nora was trying to make. Her arguments are easily formalizable as critiquing an indifference principle over functions in function-space, as opposed to over parameterizations in parameter-space. I'll write this out for you if you really want me to.

I think you should be more cautious at unilaterally diagnosing Nora's "errors", as opposed to asking for clarification, because I think you two agree a lot more than you realize.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:13:11.504Z · LW · GW

I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we'd both prefer for that to not happen.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T01:08:58.039Z · LW · GW

Yes, but your original comment was presented as explaining "how to properly reason about counting arguments." Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.

Another concern I have is, I don't think you're gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to "decompose" a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of "simplicity" being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims. 

I also don't think I've seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility. 

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-05T00:53:16.604Z · LW · GW

I said "Deceptive reasoning in general", not the trainability of the backdoor behavior in your experimental setup. The issue isn't just "what was the trainability of the surface behavior", but "what is the trainability of the cognition implementing this behavior in-the-wild." That is, the local inductive biases are probably far different for "parameterization implementing directly-trained deceptive reasoning" vs "parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning." 

Imagine if I were arguing for some hypothetical results of mine, saying "The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models." Would that be a valid argument given the supposed experimental result?

Comment by TurnTrout on TurnTrout's shortform feed · 2024-03-04T23:55:56.830Z · LW · GW

Context for my original comment: I think that the key thing we want to do is predict the generalization of future neural networks. What will they do in what situations? 

For some reason, certain people think that pretraining will produce consequentialist inner optimizers. This is generally grounded out as a highly specific claim about the functions implemented by most low-loss parameterizations of somewhat-unknown future model architectures trained on somewhat-unknown data distributions. 

I am in particular thinking about "Playing the training game" reasoning, which is---at its core---extremely speculative and informal claims about inductive biases / the functions implemented by such parameterizations. If a person (like myself pre-2022) is talking about how AIs "might play the training game", but also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned. To put it mildly.

Given that clarification which was not present in the original comment,

  • I disagree on game theory, econ, computer security, business, and history; those seem totally irrelevant for reasoning about inductive biases (and you might agree). However they seem useful for reasoning about the impact of AI on society as it becomes integrated.
  • Agree very weakly on distributed systems and moderately on cognitive psychology. (I have in fact written a post on the latter: Humans provide an untapped wealth of evidence about alignment.)

well they explain relatively narrow results about current ML systems.

Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T20:47:50.637Z · LW · GW

I'm very happy with running counting arguments over the actual neural network parameter space; the problem there is just that I don't think we understand it well enough to do so effectively.

  1. This is basically my position as well
  2. The cited argument is a counting argument over the space of functions which achieve zero/low training loss. 

You could instead try to put a measure directly over the functions in your setup, but the problem there is that function space really isn't the right space to run a counting argument like this; you need to be in algorithm space, otherwise you'll do things like what happens in this post where you end up predicting overfitting rather than generalization (which implies that you're using a prior that's not suitable for running counting arguments on).

Indeed, this is a crucial point that I think the post is trying to make. The cited counting arguments are counting functions instead of parameterizations. That's the mistake (or, at least "a" mistake). I'm glad we agree it's a mistake, but then I'm confused why you think that part of the post is unsound. 


Rereading the portion in question now, it seems that they changed it a lot since the draft. Personally, I think their argumentation is now weaker than it was before. The original argumentation clearly explained the mistake of counting functions instead of parameterizations, while the present post does not. It instead abstracts it as "an indifference principle", where the reader has to do the work to realize that indifference over functions is inappropriate. 

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T20:41:51.904Z · LW · GW

Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don't, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)

EDITed to remove last bit after reading elsewhere in thread.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T20:33:56.010Z · LW · GW

You didn't claim it for deceptive alignment, but you claimed disproof of the idea that deceptive reasoning would be trained away, which is an important subcomponent of deceptive alignment. But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general. 

I think the presentation of your work (which, again, I like in many respects) would be strengthened if you clarified the comment which I responded to.

But I think they're quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.

Because the current results only deal with backdoor removal, I personally think it's outweighed by e.g. results on how well instruction-tuning generalizes.

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T19:29:26.972Z · LW · GW

From my perspective reading this post, it read to me like "I didn't understand the counting argument, therefore it doesn't make sense" which is (obviously) not very compelling to me.

I definitely appreciate how it can feel frustrating or bad when you feel that someone isn't properly engaging with your ideas. However, I also feel frustrated by this statement. Your comment seems to have a tone of indignation that Quintin and Nora weren't paying attention to what you wrote. 

I myself expected you to respond to this post with some ML-specific reasoning about simplicity and measure of parameterizations, instead of your speculation about a relationship between the universal measure and inductive biases. I spoke with dozens of people about the ideas in OP's post, and none of them mentioned arguments like the one you gave. I myself have spent years in the space and am also not familiar with this particular argument about bitstrings. 

(EDIT: Having read Ryan's comment, it now seems to me that you have exclusively made a simplicity argument without any counting involved, and an empirical claim about the relationship between description length of a mesa objective and the probability of SGD sampling a function which implements such an objective. Is this correct?)

If these are your real reasons for expecting deceptive alignment, that's fine, but I think you've mentioned this rather infrequently. Your profile links to How likely is deceptive alignment?, which is an (introductory) presentation you gave. In that presentation, you make no mention of Turing machines, universal semimeasures, bitstrings, and so on. On a quick search, the closest you seem to come is the following:

We're going to start with simplicity. Simplicity is about specifying the thing that you want in the space of all possible things. You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can understand simplicity by doing a counting argument, which is just asking, how many models are in each model class?[1]

But this is ambiguous (as can be expected for a presentation at this level). We could view this as "bitlength under a given decoding scheme, viewing an equivalence class over parameterizations as a set of possible messages" or "Shannon information (in bits) of a function induced by a given probability distribution over parameterizations" or something else entirely (perhaps having to do with infinite bitstrings). 

My critique is not "this was ambiguous." My critique is "how was anyone supposed to be aware of the 'real' argument which I (and many others) seem to now be encountering for the first time?". 

My objection is that the sort of finite bitstring analysis in this post does not yield any well-defined mathematical object at all, and certainly not one that would predict generalization.

This seems false? All that needs be done is to formally define  

which is the set of functions which (when e.g. greedily sampled) perfectly label the (categorical) training data , and we can parameterize such functions using the neural network parameter space. This yields a perfectly well-defined counting argument over .

  1. ^

    This seems to be exactly the counting argument the post is critiquing, by the way. 

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T19:04:51.682Z · LW · GW

By the logic of the post, step 4 is the problem, but I think step 4 is actually valid. The problem is step 2: there are actually a huge number of different ways to implement a line! Not only are there many different programs that implement the line in different ways, I can also just take the simplest program that does so and keep on adding comments or other extraneous bits.

Evan, I wonder how much your disagreement is engaging with OPs' reasons. A draft of this post motivated the misprediction of both counting arguments as trying to count functions instead of parameterizations of functions; one has to consider the compressivity of the parameter-function map (many different internal parameterizations map to the same external behavior). Given that the authors actually agree that 2 is incorrect, does this change your views?

Comment by TurnTrout on And All the Shoggoths Merely Players · 2024-03-04T18:37:03.123Z · LW · GW

The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit.

I think this is a reasonable prediction, but ends up being incorrect: 

It decreases far faster than it should; on the top-1 theory, it should be ~flatlined for this whole graph (since for all  the strict majority of labels are still correct). Certainly top-5 should not be decreasing.

Comment by TurnTrout on TurnTrout's shortform feed · 2024-03-04T18:17:18.822Z · LW · GW

(As an obvious corollary, I myself was misguided to hold a similar belief pre-2022.)

Comment by TurnTrout on Dreams of AI alignment: The danger of suggestive names · 2024-03-04T18:08:03.376Z · LW · GW

I do think that a wide range of shard-based mind-structures will equilibrate into EU optimizers, but I also think this is a somewhat mild statement. My stance is that utility functions represent a yardstick by which decisions are made. "Utility was made by the agent, for the agent" as it were--- and not "the agent is made to optimize the utility." What this means is: 

Suppose I start off caring about dogs and diamonds in a shard-like fashion, with certain situations making me seek out dogs and care for them (in the usual intuitive way); and similarly for diamonds. However, there will be certain situations in which the dog-shard "interferes with" the diamond-shard, such that the dog-shard e.g. makes me daydream about dogs while doing my work and thereby do worse in life overall. If I didn't engage in this behavior, then in general I'd probably be able to get more dog-caring and diamond-acquisition. So from the vantage point of this mind and its shards, it is subjectively better to not engage in such "incoherent" behavior which is a strictly dominated strategy in expectation (i.e. leads to fewer dogs and diamonds).

Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn't step on its own toes like this. 

This doesn't mean, of course, that these shards decide to implement a utility function with absurd results by the initial decision-making procedure. For example, tiling the universe (half with dog-squiggles, half with diamond-squiggles) would not be a desirable outcome under the initial decision-making process. Insofar as such an outcome could be foreseen as a consequence of making decisions a proposed utility function, the shards would disprefer that utility function.[1] 

So any utility function chosen should "add up to normalcy" when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards' reckoning. On this view, one would derive a utility function as a rule of thumb for how to make decisions effectively and (nearly) Pareto-optimally in relevant scenarios.[2] 

(You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like "Unforeseen optima are extremely problematic given high amounts of optimization power.")

  1. ^

    This elides any practical issues with self-modification, and possible value drift from e.g. external sources, and so on. I think they don't change the key conclusions here. I think they do change conclusions for other questions though.

  2. ^

    Again, if I'm imagining the vantage point of dog+diamond agent, it wouldn't want to waste tons of compute deriving a policy for weird situations it doesn't expect to run into. The most important place to become more coherent is the expected on-policy future.

Comment by TurnTrout on TurnTrout's shortform feed · 2024-03-04T17:55:04.289Z · LW · GW

I think some people have the misapprehension that one can just meditate on abstract properties of "advanced systems" and come to good conclusions about unknown results "in the limit of ML training", without much in the way of technical knowledge about actual machine learning results or even a track record in predicting results of training.

For example, several respected thinkers have uttered to me English sentences like "I don't see what's educational about watching a line go down for the 50th time" and "Studying modern ML systems to understand future ones seems like studying the neurobiology of flatworms to understand the psychology of aliens." 

I vehemently disagree. I am also concerned about a community which (seems to) foster such sentiment.

Comment by TurnTrout on Dreams of AI alignment: The danger of suggestive names · 2024-03-04T16:41:45.728Z · LW · GW

Thanks for sharing! :) 

To clarify on my end: I think AI can definitely become an autonomous long-horizon planner, especially if we train it to be that. 

That event may or may not have the consequences suggested by existing theory predicated on e.g. single-objective global utility maximizers, which predicts consequences which are e.g. notably different from the predictions of a shard-theoretic model of how agency develops. So I think there are important modeling decisions in 'literal-minded genie' vs 'shard-based generalization' vs [whatever the truth actually is]... even if each individual axiom sounds reasonable in any given theory. (I wrote this quickly, sorry if it isn't clear)

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T16:16:20.699Z · LW · GW

I think this is an excellent post. I really liked the insight about the mechanisms (and mistakes) shared by the counting arguments behind AI doom and behind "deep learning surely won't generalize." Thank you for writing this; these kinds of loose claims have roamed freely for far too long.

EDIT: Actually this post is weaker than a draft I'd read. I still think it's good, but missing some of the key points I liked the most. And I'm not on board with all of the philosophical claims about e.g. generalized objections to the principle of indifference (in part because I don't understand them).

Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T16:04:54.060Z · LW · GW

empirically disprove the hypothesis that deceptive reasoning will be naturally regularized away (interestingly, we find that it does get regularized away for small models—but not for large models!).

You did not "empirically disprove" that hypothesis. You showed that if you explicitly train a backdoor for a certain behavior under certain regimes, then training on other behaviors will not cause catastrophic forgetting. You did not address the regime where the deceptive reasoning arises as instrumental to some other goal embedded in the network, or in a natural context (as you're aware). I think that you did find a tiny degree of evidence about the question (it really is tiny IMO), but you did not find "disproof."

Indeed, I predicted that people would incorrectly represent these results; so little time has passed!

I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a "hard-coded" demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably 

  1. ...
  2. [claim] that we've observed it's hard to uproot deceptive alignment (even though "uprooting a backdoored behavior" and "pushing back against misgeneralization" are different things), 
Comment by TurnTrout on Counting arguments provide no evidence for AI doom · 2024-03-04T16:04:06.976Z · LW · GW
  1. I am very skeptical of hand-wavy arguments about simplicity that don't have formal mathematical backing. This is a very difficult area to reason about correctly and it's easy to go off the rails if you're trying to do so without relying on any formalism.

I'm surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don't (AFAICT) have much to do with the reality of neural networks. EG, your comments above: 

I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that's just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let's assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.

Or the times you've talked about how there are "more" sycophants but only "one" saint. 


  1. There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don't change the bottom-line conclusion, but if you have a concrete mathematical model that you'd like to present here that you think gives a different result, I'm all ears.

This is a very strange burden of proof. It seems to me that you presented a specific model of how NNs work which is clearly incorrect, and instead of processing counterarguments that it doesn't make sense, you want someone else to propose to you a similarly detailed model which you think is better. Presenting an alternative is a logically separate task from pointing out the problems in the model you gave.

Comment by TurnTrout on Reward is not the optimization target · 2024-02-28T21:00:08.458Z · LW · GW

Just now saw this very thoughtful review. I share a lot of your perspective, especially:

I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It's the former view that this post (correctly) argues against.


Looking back at the post, I felt some amount of "why are you meandering around instead of just saying the Thing?", with the immediate next thought being "well, it's hard to say the Thing". Indeed, I do not know how to say it better.

Comment by TurnTrout on Brainstorm of things that could force an AI team to burn their lead · 2024-02-26T18:13:15.581Z · LW · GW

On an extremely brief skim, I do appreciate the concreteness still. I think it's very off-target in thinking about "what are the goals?", because I think that's not a great abstraction for what we're likely to get.

Comment by TurnTrout on And All the Shoggoths Merely Players · 2024-02-26T18:02:00.813Z · LW · GW

"Endpoints are easier to predict than intermediate trajectories" seems like a locally valid and relevant point to bring up.

  1. I don't think it's true here. Why should it be true?
  2. However, to clarify, I was calling the second quoted sentence a word game, not the first.

Then there is a valid argument here that there are lots of reasons people want to build powerful AGI


that the argument about the structure of the cognition here is intended to apply to an endpoint where those goals are achieved,

[People want an outcome with property X and so we will get such an outcome]

[One outcome with property X involves cognitive structures Y]

Does not entail

[We will get an outcome with property X and cognitive structures Y]

But this is basically the word game!

  1. "Whenever I talk about 'powerful' agents, I choose to describe them as having inner cognitive properties Y (e.g. the long-term consequentialism required for scheming)"
  2. which vibes its way into "The agents are assumed to be powerful, how can you deny they have property Y?"
  3. and then finally "People want 'powerful' agents and so will create them, and then we will have to deal with agents with inner cognitive property Y"

It sounds obviously wrong when I spell it out like this, but it's what is being snuck in by sentences like

I'm talking about what will happen inside almost any sufficiently powerful AGI, by virtue of it being sufficiently powerful.

For convenience, I quote the fuller context:

Doomimir: [starting to anger] Simplicia Optimistovna, if you weren't from Earth, I'd say I don't think you're trying to understand. I never claimed that GPT-4 in particular is what you would call deceptively aligned. Endpoints are easier to predict than intermediate trajectories. I'm talking about what will happen inside almost any sufficiently powerful AGI, by virtue of it being sufficiently powerful.