Posts

Deconstructing Bostrom's Classic Argument for AI Doom 2024-03-11T05:58:11.968Z
Counting arguments provide no evidence for AI doom 2024-02-27T23:03:49.296Z
My Kind of Pragmatism 2023-05-20T18:58:48.574Z
DeepMind’s generalist AI, Gato: A non-technical explainer 2022-05-16T21:21:24.214Z

Comments

Comment by Nora Belrose (nora-belrose) on What's with all the bans recently? · 2024-04-22T12:23:23.417Z · LW · GW

I don't know what caused it exactly, and it seems like I'm not rate limited anymore.

Comment by Nora Belrose (nora-belrose) on What's with all the bans recently? · 2024-04-20T06:58:39.454Z · LW · GW

If moderators started rate-limiting Nora Belrose or someone else whose work I thought was particularly good

I actually did get rate-limited today, unfortunately.

Comment by Nora Belrose (nora-belrose) on Inducing Unprompted Misalignment in LLMs · 2024-04-19T23:55:29.350Z · LW · GW

Unclear why this is supposed to be a scary result.

"If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains" - Matthew Barnett

Comment by Nora Belrose (nora-belrose) on Deconstructing Bostrom's Classic Argument for AI Doom · 2024-03-14T16:42:15.883Z · LW · GW

Yeah, I think Evan is basically opportunistically changing his position during that exchange, and has no real coherent argument.

Comment by Nora Belrose (nora-belrose) on Deconstructing Bostrom's Classic Argument for AI Doom · 2024-03-12T00:07:16.382Z · LW · GW

I do think that Solomonoff-flavored intuitions motivate much of the credence people around here put on scheming. Apparently Evan Hubinger puts a decent amount of weight on it, because he kept bringing it up in our discussion in the comments to Counting arguments provide no evidence for AI doom.

Comment by Nora Belrose (nora-belrose) on Deconstructing Bostrom's Classic Argument for AI Doom · 2024-03-11T11:15:43.946Z · LW · GW

The strong version as defined by Yudkowsky... is pretty obvious IMO

I didn't expect you'd say that. In my view it's pretty obviously false. Knowledge and skills are not value-neutral, and some goals are a lot harder to instill into an AI than others bc the relevant training data will be harder to come by. Eliezer is just not taking into account data availability whatsoever, because he's still fundamentally thinking about things in terms of GOFAI and brains in boxes in basements rather than deep learning. As Robin Hanson pointed out in the foom debate years ago, the key component of intelligence is "content." And content is far from value neutral.

Comment by Nora Belrose (nora-belrose) on Deconstructing Bostrom's Classic Argument for AI Doom · 2024-03-11T07:48:12.463Z · LW · GW

As I argue in the video, I actually think the definitions of "intelligence" and "goal" that you need to make the Orthogonality Thesis trivially true are bad, unhelpful definitions. So I both think that it's false, and even if it were true it'd be trivial.

I'll also note that Nick Bostrom himself seems to be making the motte and bailey argument here, which seems pretty damning considering his book was very influential and changed a lot of people's career paths, including my own.

Edit replying to an edit you made: I mean, the most straightforward reading of Chapters 7 and 8 of Superintelligence is just a possibility-therefore-probability fallacy in my opinion. Without this fallacy, there would be little need to even bring up the orthogonality thesis at all, because it's such a weak claim.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-07T07:16:02.002Z · LW · GW

If it's spontaneous then yeah, I don't expect it to happen ~ever really. I was mainly thinking about cases where people intentionally train models to scheme.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-07T02:59:18.437Z · LW · GW

What do you mean "hugely edited"? What other things would you like us to change? If I were starting from scratch I would of course write the post differently but I don't think it would be worth my time to make major post hoc edits; I would like to focus on follow up posts.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-05T20:26:27.092Z · LW · GW

Isn't Evan giving you what he thinks is a valid counting argument i.e. a counting argument over parameterizations? 

Where is the argument? If you run the counting argument in function space, it's at least clear why you might think there are "more" schemers than saints. But if you're going to say there are "more" params that correspond to scheming than there are saint-params, that looks like a substantive empirical claim that could easily turn out to be false.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-05T07:10:42.840Z · LW · GW

It's not clear to me what an "algorithm" is supposed to be here, and I suspect that this might be cruxy. In particular I suspect (40-50% confidence) that:

  • You think there are objective and determinate facts about what "algorithm" a neural net is implementing, where
  • Algorithms are supposed to be something like a Boolean circuit or a Turing machine rather than a neural network, and
  • We can run counting arguments over these objective algorithms, which are distinct both from the neural net itself and the function it expresses.

I reject all three of these premises, but I would consider it progress if I got confirmation that you in fact believe in them.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-05T07:00:30.760Z · LW · GW

I'm sorry to hear that you think the argumentation is weaker now.

the reader has to do the work to realize that indifference over functions is inappropriate

I don't think that indifference over functions in particular is inappropriate. I think indifference reasoning in general is inappropriate.

I'm very happy with running counting arguments over the actual neural network parameter space

I wouldn't call the correct version of this a counting argument. The correct version uses the actual distribution used to initialize the parameters as a measure, and not e.g. the Lebesgue measure. This isn't appealing to the indifference principle at all, and so in my book it's not a counting argument. But this could be terminological.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-05T03:13:41.643Z · LW · GW

Fair enough if you never read any of these comments.

Yeah, I never saw any of those comments. I think it's obvious that the most natural reading of the counting argument is that it's an argument over function space (specifically, over equivalence classes of functions which correspond to "goals.") And I also think counting arguments for scheming over parameter space, or over Turing machines, or circuits, or whatever, are all much weaker. So from my perspective I'm attacking a steelman rather than a strawman.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-05T03:07:00.376Z · LW · GW

I've read every word of all of your comments.

I know that you think your criticism isn't dependent on Solomonoff induction in particular, because you also claim that a counting argument goes through under circuit prior. It still seems like you view the Solomonoff case as the central one, because you keep talking about "bitstrings." And I've repeatedly said that I don't think the circuit prior works either, and why I think that.

At no point in this discussion have you provided any reason for thinking that in fact, the Solomonoff prior and/or circuit prior do provide non-negligible evidence about neural network inductive biases, despite the very obvious mechanistic disanalogies.

Yes—that's exactly the sort of counting argument that I like!

Then make an NNGP counting argument! I have not seen such an argument anywhere. You seem to be alluding to unpublished, or at least little-known, arguments that did not make their way into Joe's scheming report.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-05T02:47:48.288Z · LW · GW

So today we've learned that:

  1. The real counting argument that Evan believes in is just a repackaging of Paul's argument for the malignity of the Solomonoff prior, and not anything novel.
  2. Evan admits that Solomonoff is a very poor guide to neural network inductive biases.

At this point, I'm not sure why you're privileging the hypothesis of scheming at all.

you want to substitute it out for whatever the prior is that you think is closest to deep learning that you can still reason about theoretically.

I mean, the neural network Gaussian process is literally this, and you can make it more realistic by using the neural tangent kernel to simulate training dynamics, perhaps with some finite width corrections. There is real literature on this.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-05T02:30:47.177Z · LW · GW

What makes you think that's intended to be a counting argument over function space? I usually think of this as a counting argument over infinite bitstrings

I definitely thought you were making a counting argument over function space, and AFAICT Joe also thought this in his report.

The bitstring version of the argument, to the extent I can understand it, just seems even worse to me. You're making an argument about one type of learning procedure, Solomonoff induction, which is physically unrealizable and AFAICT has not even inspired any serious real-world approximations, and then assuming that somehow the conclusions will transfer over to a mechanistically very different learning procedure, gradient descent. The same goes for the circuit prior thing (although FWIW I think you're very likely wrong that minimal circuits can be deceptive).

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-04T19:11:18.704Z · LW · GW

FWIW I object to 2, 3, and 4, and maybe also 1.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-04T11:29:31.870Z · LW · GW

It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that's usually what I do when I run this sort of analysis.

What? Which formalism? I don't see how this is true at all. Please elaborate or send an example of "modifying" Solomonoff so that all the programs have fixed length, or "modifying" the circuit prior so all circuits are the same size.

No, I'm pretty familiar with your writing. I still don't think you're focusing on mainstream ML literature enough because you're still putting nonzero weight on these other irrelevant formalisms. Taking that literature seriously would mean ceasing to take the Solomonoff or circuit prior literature seriously.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-04T04:15:24.544Z · LW · GW

Right, and I've explained why I don't think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable. There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I'm pretty baffled at why you don't pay more attention to that stuff.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-04T02:12:42.781Z · LW · GW

Then show me how! If you think there are errors in the math, please point them out.

I'm not aware of any actual math behind the counting argument for scheming. I've only ever seen handwavy informal arguments about the number of Christs vs Martin Luthers vs Blaise Pascals. There certainly was no formal argument presented in Joe's extensive scheming report, which I assumed would be sufficient context for writing this essay.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-04T01:06:23.616Z · LW · GW

I'm saying <0.1% chance on "world is ended by spontaneous scheming." I'm not saying no AI will ever do anything that might be well-described as scheming, for any reason.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-04T01:02:21.410Z · LW · GW

I obviously don't think the counting argument for overfitting is actually sound, that's the whole point. But I think the counting argument for scheming is just as obviously invalid, and misuses formalisms just as egregiously, if not moreso.

I deny that your Kolmogorov framework is anything like "the proper formalism" for neural networks. I also deny that the counting argument for overfitting is appropriately characterized as a "finite bitstring" argument, because that suggests I'm talking about Turing machine programs of finite length, which I'm not- I'm directly enumerating functions over a subset of the natural numbers. Are you saying the set of functions over 1...10,000 is not a well defined mathematical object?

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-03-04T00:43:33.677Z · LW · GW

I never used any kind of bitstring analysis.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-29T16:44:19.360Z · LW · GW

I think the infinite bitstring case has zero relevance to deep learning.

There does exist a concept you might call "simplicity" which is relevant to deep learning. The neural network Gaussian process describes the prior distribution over functions which is induced by the initialization distribution over neural net parameters. Under weak assumptions about the activation function and initialization variance, the NNGP is biased toward lower frequency functions. I think this cuts against scheming, and we plan to write up a post on this in the next month or two.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-29T03:46:33.954Z · LW · GW

I'm well aware of how it's derived. I still don't think it makes sense to call that an indifference prior, precisely because enforcing an uncomputable halting requirement induces an exponentially strong bias toward short programs. But this could become a terminological point.

I think relying on an obviously incorrect formalism is much worse than relying on no formalism at all. I also don't think I'm relying on zero formalism. The literature on the frequency/spectral bias is quite rigorous, and is grounded in actual facts about how neural network architectures work.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-29T03:31:11.887Z · LW · GW

Thanks for the reply. A couple remarks:

  • "indifference over infinite bitstrings" is a misnomer in an important sense, because it's literally impossible to construct a normalized probability measure over infinite bitstrings that assigns equal probability to each one. What you're talking about is the length weighted measure that assigns exponentially more probability mass to shorter programs. That's definitely not an indifference principle, it's baking in substantive assumptions about what's more likely.
  • I don't see why we should expect any of this reasoning about Turing machines to transfer over to neural networks at all, which is why I didn't cast the counting argument in terms of Turing machines in the post. In the past I've seen you try to run counting or simplicity arguments in terms of parameters. I don't think any of that works, but I at least take it more seriously than the Turing machine stuff.
  • If we're really going to assume the Solomonoff prior here, then I may just agree with you that it's malign in Christiano's sense and could lead to scheming, but I take this to be a reductio of the idea that we can use Solomonoff as any kind of model for real world machine learning. Deep learning does not approximate Solomonoff in any meaningful sense.
  • Terminological point: it seems like you are using the term "simple" as if it has a unique and objective referent, namely Kolmogorov-simplicity. That's definitely not how I use the term; for me it's always relative to some subjective prior. Just wanted to make sure this doesn't cause confusion.
Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T21:37:44.821Z · LW · GW

I'm not actually sure the scheming problems are "compatible" with good performance on these metrics, and even if they are, that doesn't mean they're likely or plausible given good performance on our metrics.

Human brains are way more similar to other natural brains

So I disagree with this, but likely because we are using different conceptions of similarity. In order to continue this conversation we're going to need to figure out what "similar" means, because the term is almost meaningless in controversial cases— you can fill in whatever similarity metric you want. I used the term earlier as a shorthand for a more detailed story about randomly initialized singular statistical models learned with iterative, local update rules. I think both artificial and biological NNs fit that description, and this is an important notion of similarity.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T16:16:18.978Z · LW · GW

This is just an equivocation, though. Of course you could train an AI to "scheme" against people in the sense of selling a fake blood testing service. That doesn't mean that by default you should expect AIs to spontaneously start scheming against you, and in ways you can't easily notice.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T16:11:37.650Z · LW · GW

Could you be more specific? In what way will there be non-mild distribution shifts in the future?

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T16:09:08.289Z · LW · GW

With respect to which measure though? You have to define a measure, there are going to be infinitely many possible measures you could define on this space. And then we'll have to debate if your measure is a good one.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T07:36:51.727Z · LW · GW

I just deny that they will update "arbitrarily" far from the prior, and I don't know why you would think otherwise. There are compute tradeoffs and you're doing to run only as many MCTS rollouts as you need to get good performance.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T07:33:13.486Z · LW · GW

they almost certainly don't have anything to do with what humans want, per se. (that would be basically magic)

We are obviously not appealing to literal telepathy or magic. Deep learning generalizes the way we want in part because we designed the architectures to be good, in part because human brains are built on similar principles to deep learning, and in part because we share a world with our deep learning models and are exposed to similar data.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T05:44:58.583Z · LW · GW

Hi, thanks for this thoughtful reply. I don't have time to respond to every point here now- although I did respond to some of them when you first made them as comments on the draft. Let's talk in person about this stuff soon, and after we're sure we understand each other I can "report back" some conclusions.

I do tentatively plan to write a philosophy essay just on the indifference principle soonish, because it has implications for other important issues like the simulation argument and many popular arguments for the existence of god.

In the meantime, here's what I said about the Mortimer case when you first mentioned it:

We're ultimately going to have to cash this out in terms of decision theory. If you're comparing policies for an actual detective in this scenario, the uniform prior policy is going to do worse than the "use demographic info to make a non-uniform prior" policy, and the "put probability 1 on the first person you see named Mortimer" policy is going to do worst of all, as long as your utility function penalizes being confidently wrong 1 - p(Mortimer is the killer) fraction of the time more strongly than it rewards being confidently right p(Mortimer is the killer) fraction of the time.

If we trained a neural net with cross-entropy loss to predict the killer, it would do something like the demographic info thing. If you give the neural net zero information, then with cross entropy loss it would indeed learn to use an indifference principle over people, but that's only because we've defined our CE loss over people and not some other coarse-graining of the possibility space.

For human epistemology, I think Huemer's restricted indifference principle is going to do better than some unrestricted indifference principle (which can lead to outright contradictions), and I expect my policy of "always scrounge up whatever evidence you have, and/or reason by abduction, rather than by indifference" would do best (wrt my own preference ordering at least).

There are going to be some scenarios where an indifference prior is pretty good decision-theoretically because your utility function privileges a certain coarse graining of the world. Like in the detective case you probably care about individual people more than anything else— making sure individual innocents are not convicted and making sure the individual perpetrator gets caught.

The same reasoning clearly does not apply in the scheming case. It's not like there's a privileged coarse graining of goal-space, where we are trying to minimize the cross-entropy loss of our prediction wrt that coarse graining, each goal-category is indistinguishable from every other, and almost all the goal-categories lead to scheming.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T04:49:32.350Z · LW · GW

I don't think the distinction is important, because in real-world AI systems the train -> deployment shift is quite mild, and we're usually training the model on new trajectories from deployment periodically.

The distinction only matters a lot if you ex ante believe scheming is happening, so that the tiniest difference between train and test distributions will be exploited by the AI to execute a treacherous turn.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T04:46:18.769Z · LW · GW

If networks trained via SGD can't learn scheming

It's not that they can't learn scheming. A sufficiently wide network can learn any continuous function. It's that they're biased strongly against scheming, and they're not going to learn it unless the training data primarily consists of examples of humans scheming against one another, or something.

These bullets seem like plausible reasons for why you probably won't get scheming within a single forward pass of a current-paradigm DL model, but are already inapplicable to the real-world AI systems in which these models are deployed.

Why does chaining forward passes together make any difference? Each forward pass has been optimized to mimic patterns in the training data. Nothing more, nothing less. It'll scheme in context X iff scheming behavior is likely in context X in the training corpus.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T04:40:54.884Z · LW · GW

It depends what you mean by a "way" the model can overfit.

Really we need to bring in measure theory to rigorously talk about this, and an early draft of this post actually did introduce some measure-theoretic concepts. Basically we need to define:

  • What set are we talking about,
  • What measure we're using over that set,
  • And how that measure relates to the probability measure over possible AIs.

The English locution "lots of ways to do X" can be formalized as "the measure of X-networks is high." And that's going to be an empirical claim that we can actually debate.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T02:45:08.048Z · LW · GW

Some incomplete brief replies:

Huemer... indeed seems confused about all sorts of things

Sure, I was just searching for professional philosopher takes on the indifference principle, and that chapter in Paradox Lost was among the first things I found.

Separately, "reductionism as a general philosophical thesis" does not imply the thing you call "goal reductionism"

Did you see the footnote I wrote on this? I give a further argument for it.

doesn't mean the end-to-end trained system will turn out non-modular.

I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.

There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.

To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.

That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T02:31:35.051Z · LW · GW

No, I don't think they are semantically very different. This seems like nitpicking. Obviously "they are likely to encounter" has to have some sort of time horizon attached to it, otherwise it would include times well past the heat death of the universe, or something.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T02:29:07.159Z · LW · GW

You can find my EA forum response here.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T01:20:17.183Z · LW · GW

I'm pleasantly surprised that you think the post is "pretty decent."

I'm curious which parts of the Goal Realism section you find "philosophically confused," because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.

I recall hearing your compression argument for general-purpose search a long time ago, and it honestly seems pretty confused / clearly wrong to me. I would like to see a much more rigorous definition of "search" and why search would actually be "compressive" in the relevant sense for NN inductive biases. My current take is something like "a lot of the references to internal search on LW are just incoherent" and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward "search" in ways that are totally benign.

More generally, I'm quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there's a double dissociation between these things, where "mechanistic search" is almost always benign, and grabby consequentialism need not be backed by mechanistic search.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T00:57:03.197Z · LW · GW

The point of that section is that "goals" are not ontologically fundamental entities with precise contents, in fact they could not possibly be so given a naturalistic worldview. So you don't need to "target the inner search," you just need to get the system to act the way you want in all the relevant scenarios.

The modern world is not a relevant scenario for evolution. "Evolution" did not need to, was not "intending to," and could not have designed human brains so that they would do high inclusive genetic fitness stuff even when the environment wildly dramatically changes and culture becomes completely different from the ancestral environment.

Comment by Nora Belrose (nora-belrose) on Counting arguments provide no evidence for AI doom · 2024-02-28T00:33:17.603Z · LW · GW

I doubt there would be much difference, and I think the alignment-relevant comparison is to compare in-distribution but out-of-sample performance to out-of-distribution performance. We can easily do i.i.d. splits of our data, that's not a problem. You might think it's a problem to directly test the model in scenarios where it could legitimately execute a takeover if it wanted to.

Comment by Nora Belrose (nora-belrose) on Evolution is a bad analogy for AGI: inner alignment · 2024-01-19T03:46:47.826Z · LW · GW

People come to have sparse and beyond-lifetime goals through mechanisms that are unavailable to biological evolution— it took thousands of years of memetic evolution for people to even develop the concept of a long future that we might be able to affect with our short lives. We're in a much better position to instill long-range goals into AIs, if we choose to do so— we can simply train them to imitate human thought processes which give rise to longterm-oriented behaviors.

Comment by Nora Belrose (nora-belrose) on Evolution is a bad analogy for AGI: inner alignment · 2024-01-19T02:41:12.154Z · LW · GW

It's very difficult to get any agent to robustly pursue something like IGF because it's an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it's easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you're an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.

Comment by Nora Belrose (nora-belrose) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-15T03:35:36.336Z · LW · GW

I don't think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with "Current year: XYZ" or something similar in another language (please let me know if that's not true, but that's my best-effort read of the paper).

I'm assuming we're not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model's "cognition" which would then be altered pretty significantly by preference fine tuning.

You actually might be able to set up experiments to test this, and I'd be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.

Comment by Nora Belrose (nora-belrose) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-14T18:53:07.898Z · LW · GW

So, I think this is wrong.

While our models aren't natural examples of deceptive alignment—so there's still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we've seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn't continue to hold in more natural examples as well.

While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an "inner goal" which it coherently pursues across contexts. That means there's not going to be a single activating context for the bad behavior (like in this paper, where it's just "see text that says the year is 2024" or "special DEPLOYMENT token") but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That's how you get the counting argument going— there's a wide range of goals compatible with scheming, etc. But the analogous counting argument for backdoors— there's a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument "most neural nets should overfit / fail to generalize."

I think it's far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.

Comment by Nora Belrose (nora-belrose) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-01-10T23:07:30.322Z · LW · GW

Game theory

Comment by Nora Belrose (nora-belrose) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-01-02T23:22:36.334Z · LW · GW

because of anthropic arguments, it's meaningless to look at past doom events to compute this proba

I disagree; anthropics is pretty normal (https://www.lesswrong.com/posts/uAqs5Q3aGEen3nKeX/anthropics-is-pretty-normal)

Comment by Nora Belrose (nora-belrose) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-01-02T12:41:22.685Z · LW · GW

I don't think it makes sense to "revert to a uniform prior" over {doom, not doom} here. Uniform priors are pretty stupid in general, because they're dependent on how you split up the possibility space. So I prefer to stick fairly close to the probabilities I get from induction over human history, which tell me p(doom from unilateral action) << 50%

I strongly disagree that AGI is "more dangerous" than nukes; I think this equivocates over different meanings of the term "dangerous," and in general is a pretty unhelpful comparison.

I find foom pretty ludicrous, and I don't see a reason to privilege the hypothesis much.

From the linked report:

My best guess is that we go from AGI (AI that can perform ~100% of cognitive tasks as well as a human professional) to superintelligence (AI that very significantly surpasses humans at ~100% of cognitive tasks) in less than a year.

I just agree with this (if "significantly" means like 5x or something), but I wouldn't call it "foom" in the relevant sense. It just seems orthogonal to the whole foom discussion.

Comment by Nora Belrose (nora-belrose) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-01-02T01:50:23.104Z · LW · GW

Our 1% doom number excludes misuse-flavored failure modes, so I considered it out of scope for my response. I think the fact that good humans have been able to keep rogue bad humans more-or-less under control for millennia is strong evidence that good AIs will be able to keep rogue AIs under control, and I think the evidence is pretty mixed on whether the so-called offense-defense balance will be skewed toward offense or defense— I weakly expect defense will be preferred, mainly through centralization-of-power effects.