Posts

The Worst Form Of Government (Except For Everything Else We've Tried) 2024-03-17T18:11:38.374Z
The Parable Of The Fallen Pendulum - Part 2 2024-03-12T21:41:30.180Z
The Parable Of The Fallen Pendulum - Part 1 2024-03-01T00:25:00.111Z
Leading The Parade 2024-01-31T22:39:56.499Z
A Shutdown Problem Proposal 2024-01-21T18:12:48.664Z
Some Vacation Photos 2024-01-04T17:15:01.187Z
Apologizing is a Core Rationalist Skill 2024-01-02T17:47:35.950Z
The Plan - 2023 Version 2023-12-29T23:34:19.651Z
Natural Latents: The Math 2023-12-27T19:03:01.923Z
Talk: "AI Would Be A Lot Less Alarming If We Understood Agents" 2023-12-17T23:46:32.814Z
Principles For Product Liability (With Application To AI) 2023-12-10T21:27:41.403Z
What I Would Do If I Were Working On AI Governance 2023-12-08T06:43:42.565Z
On Trust 2023-12-06T19:19:07.680Z
Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" 2023-11-21T17:39:17.828Z
On the lethality of biased human reward ratings 2023-11-17T18:59:02.303Z
Some Rules for an Algebra of Bayes Nets 2023-11-16T23:53:11.650Z
Symbol/Referent Confusions in Language Model Alignment Experiments 2023-10-26T19:49:00.718Z
What's Hard About The Shutdown Problem 2023-10-20T21:13:27.624Z
Trying to understand John Wentworth's research agenda 2023-10-20T00:05:40.929Z
Bids To Defer On Value Judgements 2023-09-29T17:07:25.834Z
Inside Views, Impostor Syndrome, and the Great LARP 2023-09-25T16:08:17.040Z
Atoms to Agents Proto-Lectures 2023-09-22T06:22:05.456Z
What's A "Market"? 2023-08-08T23:29:24.722Z
Yes, It's Subjective, But Why All The Crabs? 2023-07-28T19:35:36.741Z
Alignment Grantmaking is Funding-Limited Right Now 2023-07-19T16:49:08.811Z
Why Not Subagents? 2023-06-22T22:16:55.249Z
Lessons On How To Get Things Right On The First Try 2023-06-19T23:58:09.605Z
Algorithmic Improvement Is Probably Faster Than Scaling Now 2023-06-06T02:57:33.700Z
$500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions 2023-05-16T21:31:35.490Z
The Lightcone Theorem: A Better Foundation For Natural Abstraction? 2023-05-15T02:24:02.038Z
Result Of The Bounty/Contest To Explain Infra-Bayes In The Language Of Game Theory 2023-05-09T16:35:26.751Z
How Many Bits Of Optimization Can One Bit Of Observation Unlock? 2023-04-26T00:26:22.902Z
Why Are Maximum Entropy Distributions So Ubiquitous? 2023-04-05T20:12:57.748Z
Shannon's Surprising Discovery 2023-03-30T20:15:54.065Z
A Primer On Chaos 2023-03-28T18:01:30.702Z
$500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory 2023-03-25T17:29:51.498Z
Why Not Just Outsource Alignment Research To An AI? 2023-03-09T21:49:19.774Z
Why Not Just... Build Weak AI Tools For AI Alignment Research? 2023-03-05T00:12:33.651Z
Scarce Channels and Abstraction Coupling 2023-02-28T23:26:03.539Z
Wentworth and Larsen on buying time 2023-01-09T21:31:24.911Z
The Feeling of Idea Scarcity 2022-12-31T17:34:04.306Z
What do you imagine, when you imagine "taking over the world"? 2022-12-31T01:04:02.370Z
Applied Linear Algebra Lecture Series 2022-12-22T06:57:26.643Z
The "Minimal Latents" Approach to Natural Abstractions 2022-12-20T01:22:25.101Z
Verification Is Not Easier Than Generation In General 2022-12-06T05:20:48.744Z
The Plan - 2022 Update 2022-12-01T20:43:50.516Z
Speculation on Current Opportunities for Unusually High Impact in Global Health 2022-11-11T20:47:03.367Z
Why Aren't There More Schelling Holidays? 2022-10-31T19:31:55.964Z
Plans Are Predictions, Not Optimization Targets 2022-10-20T21:17:07.000Z
How To Make Prediction Markets Useful For Alignment Work 2022-10-18T19:01:01.292Z

Comments

Comment by johnswentworth on On Devin · 2024-03-18T15:50:14.465Z · LW · GW

So that example SWE bench problem from the post:

... is that a prototypical problem from that benchmark? Because if so, that is a hilariously easy benchmark. Like, something could ace that task and still be coding at less than a CS 101 level.

(Though to be clear, people have repeatedly told me that a surprisingly high fraction of applicants for programming jobs can't do fizzbuzz, so even a very low level of competence would still put it above many would-be software engineers.)

Comment by johnswentworth on The Parable Of The Fallen Pendulum - Part 2 · 2024-03-13T20:02:20.783Z · LW · GW

Fixed, thanks.

Comment by johnswentworth on Natural Latents: The Math · 2024-03-08T23:52:09.698Z · LW · GW

Yeah, that's right.

The secret handshake is to start with " is independent of  given " and " is independent of  given ", expressed in this particular form:

... then we immediately see that  for all  such that .

So if there are no zero probabilities, then  for all .

That, in turn, implies that  takes on the same value for all Z, which in turn means that it's equal to .  Thus  and  are independent. Likewise for  and . Finally, we leverage independence of  and  given :

(A similar argument is in the middle of this post, along with a helpful-to-me visual.)

Comment by johnswentworth on Natural Latents: The Math · 2024-03-08T19:08:06.806Z · LW · GW

Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.

This is easiest to see if we use a "strong invariance" condition, in which each of the  must mediate between  and . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure () from any little spatially-localized chunk of the gas (). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e.  only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.

Comment by johnswentworth on Many arguments for AI x-risk are wrong · 2024-03-05T16:53:24.720Z · LW · GW

I agree with this point as stated, but think the probability is more like 5% than 0.1%

Same.

I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.

Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"

That's not particularly cruxy for me either way.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".

Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.

Comment by johnswentworth on Many arguments for AI x-risk are wrong · 2024-03-05T03:57:35.153Z · LW · GW

Solid post!

I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.

Comment by johnswentworth on Increasing IQ is trivial · 2024-03-02T01:24:06.061Z · LW · GW

Yup.

Comment by johnswentworth on Increasing IQ is trivial · 2024-03-02T00:27:08.618Z · LW · GW

Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?

Comment by johnswentworth on The Parable Of The Fallen Pendulum - Part 1 · 2024-03-01T20:19:43.645Z · LW · GW

What was your old job?

Comment by johnswentworth on Counting arguments provide no evidence for AI doom · 2024-02-28T03:41:12.059Z · LW · GW

Did you see the footnote I wrote on this? I give a further argument for it.

Ah yeah, I indeed missed that the first time through. I'd still say I don't buy it, but that's a more complicated discussion, and it is at least a decent argument.

I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.

This is another place where I'd say we don't understand it well enough to give a good formal definition or operationalization yet.

Though I'd note here, and also above w.r.t. search, that "we don't know how to give a good formal definition yet" is very different from "there is no good formal definition" or "the underlying intuitive concept is confused" or "we can't effectively study the concept at all" or "arguments which rely on this concept are necessarily wrong/uninformative". Every scientific field was pre-formal/pre-paradigmatic once.

To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.

That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Man, that is one hell of a bullet to bite. Much kudos for intellectual bravery and chutzpah!

That might be a fun topic for a longer discussion at some point, though not right now.

Comment by johnswentworth on Counting arguments provide no evidence for AI doom · 2024-02-28T02:27:21.246Z · LW · GW

I would like to see a much more rigorous definition of "search" and why search would actually be "compressive" in the relevant sense for NN inductive biases. My current take is something like "a lot of the references to internal search on LW are just incoherent" and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward "search" in ways that are totally benign.

More generally, I'm quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there's a double dissociation between these things, where "mechanistic search" is almost always benign, and grabby consequentialism need not be backed by mechanistic search.

Some notes on this:

  • I don't think general-purpose search is sufficiently well-understood yet to give a rigorous mechanistic definition. (Well, unless one just gives a very wrong definition.)
  • Likewise, I don't think we understand either search or NN biases well enough yet to make a formal compression argument. Indeed, that sounds like a roughly-agent-foundations-complete problem.
  • I'm pretty skeptical that internal general-purpose search is compressive in current architectures. (And this is one reason why I expect most AI x-risk to come from importantly-different future architectures.) Low confidence, though.
    • Also, current architectures do have at least some "externalized" general-purpose search capabilities, insofar as they can mimic the "unrolled" search process of a human or group of humans thinking out loud. That general-purpose search process is basically AgentGPT. Notably, it doesn't work very well to date.
  • Insofar as I need a working not-very-formal definition of general-purpose search, I usually use a behavioral definition: a system which can take in a representation of a problem in some fairly-broad class of problems (typically in a ~fixed environment), and solve it.
  • The argument that a system which satisfies that behavioral definition will tend to also have an "explicit search-architecture", in some sense, comes from the recursive nature of problems. E.g. humans solve large novel problems by breaking them into subproblems, and then doing their general-purpose search/problem-solving on the subproblems; that's an explicit search architecture.
  • I definitely agree that grabby consequentialism need not be backed by mechanistic search. More skeptical of the claim mechanistic search is usually benign, at least if by "mechanistic search" we mean general-purpose search (though I'd agree with a version of this which talks about a weaker notion of "search").

Also, one maybe relevant deeper point, since you seem familiar with some of the philosophical literature: IIUC the most popular way philosophers ground semantics is in the role played by some symbol/signal in the evolutionary environment. I view this approach as a sort of placeholder: it's definitely not the "right" way to ground semantics, but philosophy as a field is using it as a stand-in until people work out better models of grounding (regardless of whether the philosophers themselves know that they're doing so). This is potentially relevant to the "representation of a problem" part of general-purpose search.

I'm curious which parts of the Goal Realism section you find "philosophically confused," because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.

(I'll briefly comment on each section, feel free to double-click.)

Against Goal Realism: Huemer... indeed seems confused about all sorts of things, and I wouldn't consider either the "goal realism" or "goal reductionism" picture solid grounds for use of an indifference principle (not sure if we agree on that?). Separately, "reductionism as a general philosophical thesis" does not imply the thing you call "goal reductionism" - for instance one could reduce "goals" to some internal mechanistic thing, rather than thinking about "goals" behaviorally, and that would be just as valid for the general philosophical/scientific project of reductionism. (Not that I necessarily think that's the right way to do it.)

Goal Slots Are Expensive: just because it's "generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules" doesn't mean the end-to-end trained system will turn out non-modular. Biological organisms were trained end-to-end by evolution, yet they ended up very modular.

Inner Goals Would Be Irrelevant: I think the point this section was trying to make is something I'd classify as a pointer problem? I.e. the internal symbolic "goal" does not necessarily neatly correspond to anything in the environment at all. If that was the point, then I'm basically on-board, though I would mention that I'd expect evolution/SGD/cultural evolution/within-lifetime learning/etc to drive the internal symbolic "goal" to roughly match natural structures in the world. (Where "natural structures" cashes out in terms of natural latents, but that's a whole other conversation.)

Goal Realism Is Anti-Darwinian: Fodor obviously is deeply confused, but I think you've misdiagnosed what he's confused about. "The physical world has no room for goals with precise contents" is somewhere between wrong and a nonsequitur, depending on how we interpret the claim. "The problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter" is correct, but very incomplete as a response to Fodor.

Goal Reductionism Is Powerful: While most of this section sounds basically-correct as written, the last few sentences seem to be basically arguing for behaviorism for LLMs. There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.

Comment by johnswentworth on Counting arguments provide no evidence for AI doom · 2024-02-28T01:10:33.365Z · LW · GW

This isn't a proper response to the post, but since I've occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:

  • This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
  • There does exist a version of the counting argument which basically works.
  • The version which works routes through compression and/or singular learning theory.
  • In particular, that version would talk about "goal-slots" (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the "counting argument for overfitting" from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
  • The "Against Goal Realism" section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it's making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don't provide much direct evidence relevant to either of those.

Pretty decent post overall.

Comment by johnswentworth on Natural Latents: The Math · 2024-02-27T17:42:09.673Z · LW · GW

Edited, thanks.

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-27T04:59:23.753Z · LW · GW

Third, the nontrivial prediction of 20 here is about "compactly describable errors. "Mislabelling a large part of the time (but not most of the time)" is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you'd have a meaningful boost in generalization error, but that doesn't happen. Easy Bayes update against #20. (And if we can't agree on this, I don't see what we can agree on.)

I indeed disagree with that, and I see two levels of mistake here. At the object level, there's a mistake of not thinking through the gears. At the epistemic level, it looks like you're trying to apply the "what would I have expected in advance?" technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)

First, object-level: let's walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that's how we usually train them). If we relabel 1's as 7's 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its "real underlying uncertainty", which we'd expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.

What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.

The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky's #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (... insofar as we should update at all from this experiment, which we shouldn't very much.)

Second, epistemic-level: my best guess is that you're ignoring these gears because they're not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias. 

Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. "Which things are we actually measuring?" is itself usually figured out (if it's figured out at all) by looking at data from the experiment.

Now, this is still compatible with using the "what would I have expected in advance?" technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is "this experiment will mostly measure some random-ass thing which has little to do with the model/theory I'm interested in, and I'll have to dig through the details of the experiment and results to figure out what it measured". If one tries to apply the "what would I have expected in advance?" technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.

  1. ^

    Standard disclaimer about guessing what's going on inside other peoples' heads being hard, you have more data than I on what's in your head, etc.

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-22T06:32:16.079Z · LW · GW

This one is somewhat more Wentworth-flavored than our previous Doomimirs.

Also, I'll write Doomimir's part unquoted this time, because I want to use quote blocks within it.

On to Doomimir!


We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So ... why doesn't that happen?

Let's start with this.

Short answer: because those aren't actually very effective ways to get high ratings, at least within the current capability regime.

Long version: presumably the labeller knows perfectly well that they're working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that... have you ever personally done an exercise where you try to convince someone to do something they don't want to do, or aren't supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner's job was to not open their fist, while the other partner's job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar - not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.

Point of that story: based on my own firsthand experience, if you're not actually going to pay someone right now, then it's far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.

Ultimately, our discussion is using "threats and bribes" as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.

Now, you could reasonably respond: "Isn't it kinda fishy that the supposed failures on which your claim rests are 'illegible'?"

To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:

The iterative design loop hasn't failed yet.

Now that's a very interesting claim. I ask: what do you think you know, and how do you think you know it?

Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the "observe/orient" steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don't necessarily know that it's failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don't know a problem is there, or don't orient toward the right thing, and therefore we don't iterate on the problem.

What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:

  • There is some widely-believed falsehood. The generative model might "know" the truth, from having trained on plenty of papers by actual experts, but the raters don't know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world's subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
  • There is some even-more-widely-believed falsehood, such that even the so-called "experts" haven't figured out yet that it's false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
  • Neither raters nor developers have time to check the models' citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
  • On various kinds of "which option should I pick" questions, there's an option which results in marginally more slave labor, or factory farming, or what have you - terrible things which a user might strongly prefer to avoid, but it's extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don't reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
    • This is the sort of problem which, in the high-capability regime, especially leads to "Potemkin village world".
  • On various kinds of "which option should I pick" questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don't even attempt to test which options are best long-term, they just read the LLM's response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
  • On various kinds of "which option should I pick" questions, there are things which sound fun or are marketed as fun, but which humans mostly don't actually enjoy (or don't enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)

... and so forth. The unifying theme here is that when these failures occur, it is not obvious that they've occurred.

This makes empirical study tricky - not impossible, but it's easy to be mislead by experimental procedures which don't actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:

They varied the size of the KL penalty of an LLM RLHF'd for a summarization task, and found about what you'd expect from the vague handwaving: as the KL penalty decreases, the reward model's predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...

(Bolding mine.) As you say, one could spin that as demonstrating "yet another portent of our impending deaths", but really this paper just isn't measuring the most relevant things in the first place. It's still using human ratings as the evaluation mechanism, so it's not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.

So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?

(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)

.

.

.

So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that's "better than" the labels used for training.

OpenAI's weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a "ground truth" which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven't put that much care into it.)

More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most "coming decoupled from reality" problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That's what tends to happen, in fields where people don't have a deep understanding of the systems they're working with.

And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that's just the experimentalist themselves - this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists' interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure - John Boyd, the guy who introduced the term "OODA loop", talked a lot about that sort of failure.)

To summarize the main points here:

  • Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (... and then things are hard.)
  • AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
  • So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.

Now, I would guess that your next question is: "But how does that lead to extinction?". That is one of the steps which has been least well-explained historically; someone with your "unexpectedly low polygenic scores" can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you... <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?

But I'll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.

Comment by johnswentworth on Monthly Roundup #15: February 2024 · 2024-02-21T05:15:47.633Z · LW · GW

Is the Waldo picture at the end supposed to be Holden, or is that accidental?

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-20T05:46:23.919Z · LW · GW

The linked abstract describes how

[good generalization] holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes.

Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.

The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. Their upper-limit for biased noise made the second-most-probable label equal in probability to the correct one, and in that case the predictor's generalization accuracy plummeted from near-90% (when the correct label was only slightly more probable than the next-most-probable) to only ~50%.

How this relates to lethality #20: part of what "regular, compactly describable, predictable errors" is saying is that there will be (predictable) cases where the label most probably assigned by a human labeller is not correct (i.e. it's not what a smart well-informed human would actually want if they had all the relevant info and reflected on it). What the results of the linked paper predict, in that case, is that the net will learn to assign the "incorrect" label - the one which human labellers do, in fact, choose more often than any other. (Though, to be clear, I think this experiment is not very highly relevant one way or the other.)

As for OpenAI's weak-to-strong results...

I had some back-and-forth about those in a private chat shortly after they came out, and the main thing I remember is that it was pretty tricky to back out the actually-relevant numbers, but it was possible. Going back to the chat log just now, this is the relevant part of my notes:

Rough estimate: on the NLP task the weak model has like 60% accuracy (fig 2).

  • In cases where the weak model is right, the strong student agrees with it in like 90% of cases (fig 8b). So, on ~6% of cases (10% * 60%), the strong student is wrong by "just being dumb".
  • In cases where the weak model is wrong, the strong student's agreement is very compute-dependent, but let's pick a middle number and call it 70% (fig 8c). So, on ~28% of cases (70% * 40%), the strong student is wrong by "overfitting to weak supervision".

So in this particular case, the strong student is wrong about 34% of the time, and 28 of those percentage points are attributable to overfitting to weak supervision.

(Here "overfitting to weak supervision" is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we're seeing in the weak-to-strong paper is that the strong model learning the weak supervisor's errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in.

So overall, I definitely maintain that the empirical evidence is solidly in favor of Doomimir's story here. (And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.)

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-20T01:23:37.638Z · LW · GW

Yes, I am familiar with Levin's work.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-19T19:03:07.431Z · LW · GW

My hope here would be that a few upstream developmental signals can trigger the matrix softening, re-formation of the chemotactic signal gradient, and whatever other unknown factors are needed, all at once.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-19T17:01:16.174Z · LW · GW

Fleshing this out a bit more: insofar as development is synchronized in an organism, there usually has to be some high-level signal to trigger the synchronized transitions. Given the scale over which the signal needs to apply (i.e. across the whole brain in this case), it probably has to be one or a few small molecules which diffuse in the extracellular space. As I'm looking into possibilities here, one of my main threads is to look into both general and brain-specific developmental signal molecules in human childhood, to find candidates for the relevant molecular signals.

(One major alternative model I'm currently tracking is that the brain grows to fill the brain vault, and then stops growing. That could in-principle mechanistically work via cells picking up on local physical forces, rather than a small molecule signal. Though I don't think that's the most likely possibility, it would be convenient, since it would mean that just expanding the skull could induce basically-normal new brain growth by itself.)

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-19T06:13:25.260Z · LW · GW

Any particular readings you'd recommend?

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-18T01:28:12.157Z · LW · GW

Doomimir: I'll summarize the story you seem excited about as follows:

  • We train a predictive model on The Whole Internet, so it's really good at predicting text from that distribution.
  • The human end-users don't really want a predictive model. They want a system which can take a natural-language request, and then do what's requested. So, the humans slap a little RL (specifically RLHF) on the predictive model, to get the "request -> do what's requested" behavior.
  • The predictive model serves as a strong baseline for the RL'd system, so the RL system can "only move away from it a little" in some vague handwavy sense. (Also in the KL divergence sense, which I will admit as non-handwavy for exactly those parts of your argument which you can actually mathematically derive from KL-divergence bounds, which is currently zero of the parts of your argument.)
  • The "only move away from The Internet Distribution a little bit" part somehow makes it much less likely that the RL'd model will predict and exploit the simple predictable ways in which humans rate things. As opposed to, say, make it more likely that the RL'd model will predict and exploit the simple predictable ways in which humans rate things.

There's multiple problems in this story.

First, there's the end-users demanding a more agenty system rather than a predictor, which is why people are doing RLHF in the first place rather than raw prompting (which would be better from a safety perspective). Given time, that same demand will drive developers to make models agentic in other ways too (think AgentGPT), or to make the RLHF'd LLMs more agentic and autonomous in their own right. That's not the current center of our discussion, but it's worth a reminder that it's the underlying demand which drives developers to choose more risky methods (like RLHF) over less risky methods (like raw predictive models) in the first place.

Second, there's the vague handwavy metaphor about the RL system "only moving away from the predictive model a little bit". The thing is, we do need more than a handwavy metaphor! "Yes, we don't understand at the level of math how making that KL-divergence small will actually impact anything we actually care about, but my intuition says it's definitely not going to kill everyone. No, I haven't been able to convince relevant experts outside of companies whose giant piles of money are contingent on releasing new AI products regularly, but that's because they're not releasing products and therefore don't have firsthand experience of how these systems behave. No, I'm not willing to subject AI products to a burden-of-proof before they induce a giant disaster" is a non-starter even if it turns out to be true.

Third and most centrally to the current discussion, there's still the same basic problem as earlier: to a system with priors instilled by The Internet, ["I'll give you $100 if you classify this as an apple" -> (predict apple classification)] is still a simple thing to learn. It's not like pretraining on the internet is going to make the system favor models which don't exploit the highly predictable errors made by human raters. If anything, all that pretraining will make it easier for the model to exploit raters. (And indeed, IIUC that's basically what we see in practice.)

As you say: the fact that GPT-4 can do that seems like it's because that kind of reasoning appears on the internet.

(This one's not as well-written IMO, it's mashing a few different things together.)

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-18T00:31:45.100Z · LW · GW

I'd be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes.

See here. I haven't dug into it much, but it does talk about the same general issues specifically in the context of RLHF'd LLMs, not just pure-RL-trained models.

(I'll get around to another Doomimir response later, just dropping that link for now.)

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-17T17:26:10.097Z · LW · GW

Zeroth point: under a Doomimir-ish view, the "modelling the human vs modelling in a similar way to the human" frame is basically right for current purposes, so no frame clash.

On to the main response...

Doomimir: This isn't just an "in the limit" argument. "I'll give you $100 if you classify this as an apple" -> (predict apple classification) is not some incredibly high-complexity thing to figure out. This isn't a jupiter-brain sort of challenge.

For instance, anything with a simplicity prior at all similar to humans' simplicity prior will obviously figure it out, as evidenced by the fact that humans can figure out hypotheses like "it's bribing the classifier" just fine. Even beyond human-like priors, any ML system which couldn't figure out something that basic would apparently be severely inferior to humans in at least one very practically-important cognitive domain.

Even prior to developing a full-blown model of the human rater, models can incrementally learn to predict the systematic errors in human ratings, and we can already see that today. The classic case of the grabber hand is a go-to example:

(A net learned to hold the hand in front of the ball, so that it looks to a human observer like the ball is being grasped. Yes, this actually happened.)

... and anecdotally, I've generally heard from people who've worked with RLHF that as models scale up, they do in fact exploit rater mistakes more and more, and it gets trickier to get them to do what we actually want. This business about "The technology in front of us really does seem like it's 'reasoning with' rather than 'reasoning about'" is empirically basically false, and seems to get more false in practice as models get stronger even within the current relatively-primitive ML regime.

So no, this isn't a "complicated empirical question" (or a complicated theoretical question). The people saying "it's a complicated empirical question, we Just Can't Know" are achieving their apparent Just Not Knowing by sticking their heads in the sand; their lack of knowledge is a fact about them, not a fact about the available evidence.

(I'll flag here that I'm channeling the character of Doomimir and would not necessarily say all of these things myself, especially the harsh parts. Happy to play out another few rounds of this, if you want.)

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-17T06:10:36.893Z · LW · GW

Ever since GeneSmith's post and some discussion downstream of it, I've started actively tracking potential methods for large interventions to increase adult IQ.

One obvious approach is "just make the brain bigger" via some hormonal treatment (like growth hormone or something). Major problem that runs into: the skull plates fuse during development, so the cranial vault can't expand much; in an adult, the brain just doesn't have much room to grow.

BUT this evening I learned a very interesting fact: ~1/2000 infants have "craniosynostosis", a condition in which their plates fuse early. The main treatments involve surgery to open those plates back up and/or remodel the skull. Which means surgeons already have a surprisingly huge amount of experience making the cranial vault larger after plates have fused (including sometimes in adults, though this type of surgery is most common in infants AFAICT)

.... which makes me think that cranial vault remodelling followed by a course of hormones for growth (ideally targeting brain growth specifically) is actually very doable with current technology.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-17T00:33:26.491Z · LW · GW

Were these supposed to embed as videos? I just see stills, and don't know where they came from.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-16T17:47:27.533Z · LW · GW

Yeah, those were exactly the two videos which most made me think that the model was mostly trained on video game animation. In the tokyo one, the woman's facial muscles never move at all, even when the camera zooms in on her. And in the SUV one, the dust cloud isn't realistic, but even covering that up the SUV has a Grand Theft Auto look to its motion.

"Can't do both complex motion and photorealism in the same video" is a good hypothesis to track, thanks for putting that one on my radar.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-16T16:56:49.431Z · LW · GW

I keep seeing news outlets and the like say that SORA generates photorealistic videos, can model how things move in the real world, etc. This seems like blatant horseshit? Every single example I've seen looks like video game animation, not real-world video.

Have I just not seen the right examples, or is the hype in fact decoupled somewhat from the model's outputs?

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-16T00:26:06.777Z · LW · GW

Well, it wasn't just a temporary bump:

... so it's presumably also not just the result of pandemic giveaway fraud, unless that fraud is ongoing.

Presumably the thing to check here would be TFP, but Fred's US TFP series currently only goes to end of 2019, so apparently we're still waiting on that one? Either that or I'm looking at the wrong series.

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-15T17:21:47.841Z · LW · GW

But I'm not sure how to reconcile that with the empirical evidence that deep networks are robust to massive label noise: you can train on MNIST digits with twenty noisy labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label. If I extrapolate that to the frontier AIs of tomorrow, why doesn't that predict that biased human reward ratings should result in a small performance reduction, rather than ... death?

The conversation didn't quite get to Doomimir actually answering this part, but I'd consider the standard answer to be item #20 on Eliezer's List O'Doom:

20.  Human operators are fallible, breakable, and manipulable.  Human raters make systematic errors - regular, compactly describable, predictable errors.  To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.  It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

... and yeah, there are definitely nonzero empirical results on that.

Comment by johnswentworth on Disrupting malicious uses of AI by state-affiliated threat actors · 2024-02-14T21:42:46.511Z · LW · GW

Soooo... they caught and disrupted use by "state-affiliated threat actors" associated with a bunch of countries at odds with the US, but not any of the US' allies?

What an interesting coincidence.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-14T18:02:47.591Z · LW · GW

Don't really need comments which are non-obvious to an expert. Part of what makes LLMs well-suited to building external cognitive tools is that external cognitive tools can create value by just tracking "obvious" things, thereby freeing up the user's attention/working memory for other things.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-14T17:12:05.939Z · LW · GW

I haven't experimented very much, but here's one example prompt.

Please describe what you mentally picture when reading the following block of text:

"
A Shutdown Problem Proposal

First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
"

This one produced basically-decent results from GPT-4.

Although I don't have the exact prompt on hand at the moment, I've also asked GPT-4 to annotate a piece of code line-by-line with a Fermi estimate of its runtime, which worked pretty well.

Comment by johnswentworth on johnswentworth's Shortform · 2024-02-13T23:01:12.945Z · LW · GW

Here's an AI-driven external cognitive tool I'd like to see someone build, so I could use it.

This would be a software tool, and the user interface would have two columns. In one column, I write. Could be natural language (like google docs), or code (like a normal IDE), or latex (like overleaf), depending on what use-case the tool-designer wants to focus on. In the other column, a language and/or image model provides local annotations for each block of text. For instance, the LM's annotations might be:

  • (Natural language or math use-case:) Explanation or visualization of a mental picture generated by the main text at each paragraph
  • (Natural language use-case:) Emotional valence at each paragraph
  • (Natural language or math use-case:) Some potential objections tracked at each paragraph
  • (Code:) Fermi estimates of runtime and/or memory usage

This is the sort of stuff I need to track mentally in order to write high-quality posts/code/math, so it would potentially be very high value to externalize that cognition.

Also, the same product could potentially be made visible to readers (for the natural language/math use-cases) to make more visible the things the author intends to be mentally tracked. That, in turn, would potentially make it a lot easier for readers to follow e.g. complicated math.

Comment by johnswentworth on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-13T19:23:13.321Z · LW · GW

Consider the consensus genome of some species of tree.

Long before we were able to sequence that genome, we were able to deduce that something-like-it existed. Something had to be carrying whatever information made these trees so similar (inter-breedable). Eventually people isolated DNA as the relevant information-carrier, but even then it was most of a century before we knew the sequence.

That sequence is a natural latent: most of the members of the tree's species are ~independent of each other given the sequence (and some general background info about our world), and the sequence can be estimated pretty well from ~any moderate-sized sample of the trees. Furthermore, we could deduce the existence of that natural latent long before we knew the sequence.

Point of this example: there's a distinction between realizing a certain natural latent variable exists, and knowing the value of that variable. To pick a simpler example: it's the difference between realizing that (P, V, T) mediate between the state of a gas at one time and its state at a later time, vs actually measuring the values of pressure, volume and temperature for the gas.

Comment by johnswentworth on Leading The Parade · 2024-02-13T17:54:56.245Z · LW · GW

Yeah, that sure does seem like a way it should be possible to have a lot of counterfactual impact. I'd be curious for any historical examples of people doing that both successfully and intentionally.

Comment by johnswentworth on Dreams of AI alignment: The danger of suggestive names · 2024-02-12T19:09:48.043Z · LW · GW

My prescription for the problem you're highlighting here: track a prototypical example.

Trying to unpack e.g. "optimization pressure" into a good definition - even an informal definition - is hard. Most people who attempt to do that will get it wrong, in the sense that their proffered definition will not match their own intuitive usage of the term (even in cases where their own intuitive usage is coherent), or their implied usage in an argument. But their underlying intuitions about "optimization pressure" are often still be correct, even if those intuitions are not yet legible. Definitions, though an obvious strategy, are not a very good one for distinguish coherent-word-usage from incoherent-word-usage.

(Note that OP's strategy of unpacking words into longer phrases is basically equivalent to using definitions, for our purposes.)

So: how can we track coherence of word usage, practically?

Well, we can check coherence constructively, i.e. by exhibiting an example which matches the usage. If someone is giving an argument involving "optimization pressure", I can ask for a prototypical example, and then walk through the argument in the context of the example to make sure that it's coherent.

For instance, maybe someone says "unbounded optimization wants to take over the world". I ask for an example. They say "Well, suppose we have a powerful AI which wants to make as many left shoes as possible. If there's some nontrivial pool of resources in the universe which it hasn't turned toward shoe-making, then it could presumably make more shoes by turning those resources toward shoe-making, so it will seek to do so.". And then I can easily pick out a part of the example to drill in on - e.g. maybe I want to focus on the implicit "all else equal" and argue that all else will not be equal, or maybe I want to drill into the "AI which wants" part and express skepticism about whether anything like current AI will "want" things in the relevant way (in which case I'd probably ask for an example of an AI which might want to make as many left shoes as possible), or [whatever else].

The key thing to notice is that I can easily express those counterarguments in the context of the example, and it will be relatively clear to both myself and my conversational partner what I'm saying. Contrast that to e.g. just trying to use very long phrases in place of "unbounded optimization" or "wants", which makes everything very hard to follow.

In one-way "conversation" (e.g. if I'm reading a post), I'd track such a prototypical example in my head. (Well, really a few prototypical examples, but 80% of the value comes from having any.) Then I can relatively-easily tell when the argument given falls apart for my example.

Comment by johnswentworth on Dreams of AI alignment: The danger of suggestive names · 2024-02-12T18:45:56.785Z · LW · GW

There's a difficult problem here.

Personally, when I see someone using the sorts of terms Turner is complaining about, I mentally flag it (and sometimes verbally flag it, saying something like "Not sure if it's relevant yet, but I want to flag that we're using <phrase> loosely here, we might have to come back to that later"). Then I mentally track both my optimistic-guess at what the person is saying, and the thing I would mean if I used the same words internally. If and when one of those mental pictures throws an error in the person's argument, I'll verbally express confusion and unroll the stack.

A major problem with this strategy is that it taxes working memory heavily. If I'm tired, I basically can't do it. I would guess that people with less baseline working memory to spare just wouldn't be able to do it at all, typically. Skill can help somewhat: it helps to be familiar with an argument already, it helps to have the general-purpose skill of keeping at least one concrete example in one's head, it helps to ask for examples... but even with the skills, working memory is a pretty important limiting factor.

So if I'm unable to do the first-best thing at the moment, what should I fall back on? In practice I just don't do a very good job following arguments when tired, but if I were optimizing for that... I'd probably fall back on asking for a concrete example every time someone uses one of the words Turner is complaining about. Wording would be something like "Ok pause, people use 'optimizer' to mean different things, can you please give a prototypical example of the sort of thing you mean so I know what we're talking about?".

... and of course when reading something, even that strategy is a pain in the ass, because I have to e.g. leave a comment asking for clarification and then the turn time is very slow.

Comment by johnswentworth on Predicting Alignment Award Winners Using ChatGPT 4 · 2024-02-08T17:04:11.082Z · LW · GW

Possibly relevant to your results: I was one of the people who judged the Alignment Award competition, and if I remember correctly the Shutdown winner (roughly this post) was head-and-shoulders better than any other submission in any category. So it's not too surprising that GPT had a harder time predicting the Goal Misgeneralization winner; there wasn't as clear a winner in that category.

Comment by johnswentworth on A Shutdown Problem Proposal · 2024-02-07T16:53:19.214Z · LW · GW

Sure. Let's adopt the "petrol/electric cars" thing from Holtman's paper. In timestep 0, the agent has a choice: either create a machine which will create one petrol car every timestep indefinitely, or create a machine which will create one petrol car every timestep until the button is pressed and then switch to electric. The agent does not have any choices after that; its only choice is which successor agent to create at the start.

The utility functions are the same as in Holtman's paper.

My main claim is that the  agent is not indifferent between the two actions; it will actively prefer the one which ignores the button. I expect this also extends to the  agent, but am less confident in that claim.

Comment by johnswentworth on A Shutdown Problem Proposal · 2024-02-03T19:43:30.206Z · LW · GW

The  agent is indifferent between creating stoppable or unstoppable subagents, but the  agent goes back to being corrigible in this way.

I think this is wrong? The  agent actively prefers to create shutdown-resistant agents (before the button is pressed), it is not indifferent.

Intuitive reasoning: prior to button-press, that agent acts-as-though it's an  maximizer and expects to continue being an  maximizer indefinitely. If it creates a successor which will shut down when the button is pressed, then it will typically expect that successor to perform worse under  after the button is pressed than some other successor which does not shut down and instead just keeps optimizing .

Either I'm missing something very major in the definitions, or that argument works and therefore the agent will typically (prior to button-press) prefer successors which don't shut down.

Comment by johnswentworth on A Shutdown Problem Proposal · 2024-02-03T18:58:06.255Z · LW · GW

Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we're commenting on. He's given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable.

Part of what's feeding into my skepticism here is that I think Holtman's formalization is substantially worse than the 2015 MIRI paper. It's adding unnecessary complexity - e.g. lots of timesteps, which in turn introduces the need for dynamic programming, which in turn requires all the proofs to work through recursive definitions - in a way which does not add any important mechanisms for making corrigibility work or clarify any subproblem. (Also, he's using MDPs, which implicitly means everything is observable at every step - a very big unrealistic assumption!) Sure, the whole thing is wrapped in more formalism, but it's unhelpful formalism which mostly makes it easier for problems to go unnoticed.

As far as I can tell from what I've read so far, he's doing qualitatively the same things the 2015 MIRI paper did, but in a setting which makes the failure modes less clear, and he's communicated it all less understandably.

I don't particularly want to spend a day or two cleaning it all up and simplifying and distilling it back down to the point where the problems (which I strongly expect exist) are obvious. If you're enthusiastic about this, then maybe try to distill it yourself? Like, figure out the core intuitive ideas of the proofs, and present those directly in the simplest-possible setup (maybe two timesteps, maybe not, whatever's simple).

Just as one example of the sort of simplification I have in mind: the definition of  makes it so that, before button-press, the agent acts like it's an  maximizer and expects to continue being an  maximizer indefinitely. After button-press, the agent acts like it's an  maximizer and expects to continue being an  maximizer indefinitely. But it required tens of minutes chasing definitions around in order to see this very intuitive and key fact. One could just as easily define the agent in a way which made that fact obvious right from the get-go.

Ideally, one would also find a similarly-clear expression for what  does, which does not require tens of minutes or more of definition-chasing in order to intuitively understand what it's doing.

Comment by johnswentworth on A Shutdown Problem Proposal · 2024-02-03T18:28:11.901Z · LW · GW

Yup, the full agent (including the two subagents) is happy to copy itself. That is a feature, not a bug. The important thing is that the copy also respects the shutdown signal.

The agent will not manipulate the signal which relays the button state except to make it more accurate, because both subagents want an accurate reading of the button state. This is because both subagents act-as-though the button itself will definitely be in their favor, and therefore any possibility of an inaccurate reading is bad for them.

Comment by johnswentworth on A Shutdown Problem Proposal · 2024-02-03T05:04:28.965Z · LW · GW

Hadn't seen that before. Based on an hour or so of sorting through it, I think it basically doesn't work.

The most legible problem I've been able to identify so far is that it will create shutdown-resistant successor agents (despite Holtman's claims to the contrary). The problem here is that, prior to the shutdown button being pressed, the agent acts-as-though it's just optimizing  and expects to continue optimizing  indefinitely; it has no incentive to maintain the option of switching to  later, because at pre-press timesteps  cancels out everything to do with  and just makes the value function mimic the -maximizing value function. And an -maximizer has no reason to make its successor agents shutdownable.

(Holtman claims that section 10 proves that some "emergent incentive" prevents this. I haven't sorted through the notational mess enough to figure out where that proof goes wrong, but I pretty strongly expect there's a mistake there, with my modal guess being that he forgot to account for the term in  which cancels out all the  contributions to . My second-most-probable guess is that  is doing something weird to make the section 10 proof work, but then it breaks things in the earlier sections where  is used in place of .)

Also, two complaints at a meta level. First, this paper generally operates by repeatedly slapping on not-very-generalizable-looking patches, every time it finds a new failure mode. This is not how one ends up with robustly-generalizable solutions; this is how one ends up with solutions which predictably fail under previously-unconsidered conditions. Second, after coming up with some interesting proof, it is important to go back and distill out the core intuitive story that makes it work; the paper doesn't really do that. Typically, various errors and oversights and shortcomings of formulation become obvious once one does that distillation step.

Comment by johnswentworth on What Failure Looks Like is not an existential risk (and alignment is not the solution) · 2024-02-02T21:13:16.912Z · LW · GW

This is a topic I'd like to see discussed more on current margins, especially for the "Get What We Measure" scenario. But I think this particular post didn't really engage with what I'd consider the central parts of a model under which Getting What We Measure leaves humanity extinct. Even if I condition on each AI being far from takeover by itself, each AI running only a small part of the world, AIs having competing goals and not coordinating well, and AIs going into production at different times... that all seems to have near-zero relevance to what I'd consider the central Getting What We Measure extinction story.

The story I'd consider central goes roughly like:

  • Species don't usually go extinct because they're intentionally hunted to extinction; they go extinct because some much more powerful species (i.e. humans) is doing big things near by which shifts the environment enough that the weaker species dies. That extends to human extinction from AI. (This post opens with a similar idea.)
  • In the Getting What We Measure scenario, humans gradually lose de-facto control/power (but retain the surface-level appearance of some control), until they have zero de-facto ability to prevent their own extinction as AI goes about doing other things.

Picking on one particular line from the post (just as one example where the post diverges from the "central story" above without justification):

If a company run by an AI will use too much oxygen (such as occurs in WFLL), there will be a public outcry, lawsuits, political pressure, and plenty of time to regulate.

This only applies very early on in a Getting What We Measure scenario, before humanity loses de-facto control over media and governance. Later on, matters of oxygen consumption are basically-entirely decided by negotiation between AIs with their own disparate goals, and what humans have to say on the matter is only relevant insofar as any of the AIs care at all what the humans have to say.

Comment by johnswentworth on Leading The Parade · 2024-02-01T19:02:16.936Z · LW · GW

+1, "gain a bunch of skill and then use it to counterfactually influence things" seems very sensible. If the plan is to gain a bunch of skill by leading a parade, then I'm somewhat more skeptical about whether that's really the best strategy for skill gain, but I could imagine situations where that's plausible.

Comment by johnswentworth on Leading The Parade · 2024-02-01T18:52:19.157Z · LW · GW

Here's an example use-case where it matters a lot.

Suppose I'm trying to do a typical Progress Studies thing; I look at a bunch of historical examples of major discoveries and inventions, in hopes of finding useful patterns to emulate. For purposes of my data-gathering, I want to do the sort of credit assignment the post talks about: I want to filter out the parade-leaders, because I expect they're mostly dominated by noise.

I do think you're mostly-right for purposes of a researcher rationally optimizing for their own status. But that's a quite different use-case from an external observer trying to understand what factors drive progress, or even a researcher trying to understand key factors in order to boost their own work.

Comment by johnswentworth on Leading The Parade · 2024-02-01T00:57:35.318Z · LW · GW

The obvious but mediocre way to assign credit in the Newton-Leibniz case would be to give them each half credit.

The better but complicated version of that would be to estimate how many people would have counterfactually figured out calculus (within some timeframe) and then divide credit by that number. Which would still potentially give a fair bit of credit to Newton and Leibniz, since calculus was very impactful.

That sounds sensible, though admittedly complicated.

Comment by johnswentworth on Making every researcher seek grants is a broken model · 2024-01-30T02:14:14.798Z · LW · GW

This is how some of the most effective, transformative labs in the world have been organized, from Bell Labs to the MRC Laboratory of Molecular Biology.

I have a post currently in my drafts pile titled "Leading The Parade". Basic idea is that lots of supposedly-successful scientists, politicians, organizations, etc, were mostly "leading the parade" - walking somewhat in front of everyone else, but not actually counterfactually impacting the trajectory of progress much. It's the opposite of counterfactual impact.

On the other hand, at least some scientists etc were counterfactually impactful. But "how much people talk about how great they were" is not a very reliable proxy for how counterfactually impactful people/organizations were; we need to look at some specific kinds of historical evidence which people usually ignore.

I use two cases from Bell Labs as central examples:

When asking whether a historical scientist or inventor had much counterfactual impact or was just leading the parade, one main type of evidence to look for is simultaneous discovery. If multiple people independently made the discovery around roughly the same time, then counterfactual impact is clearly low; the discovery would have been made even without the parade-leader. On the other hand, if nobody else was even close to figuring it out, then that’s evidence of counterfactual impact.

For example, let’s consider two big breakthroughs made at Bell Labs in the late 1940’s: the transistor, and information theory.

In the case of the transistor, we don’t have evidence quite as clear-cut as simultaneous discovery, but we have evidence almost that clear-cut. First, the people who developed the transistor at Bell Labs (notably Bardeen, Brattain and Shockley) were themselves quite worried that semiconductor researchers elsewhere (e.g. Purdue) would beat them to the punch. So they themselves apparently did not expect their impact to be highly counterfactual (other than to who got the patent).

Second, there were actually two transistor inventions at Bell Labs. Bardeen and Brattain figured out their transistor design first. Shockley, unhappy at being scooped, came along and figured out a totally different design within a few months. It’s not quite simultaneous independent invention, but clearly the problem was relatively tractable and plenty of people were working on it.

So, the inventors of the transistor were mostly “leading the parade”; they didn’t have much counterfactual impact on the technology’s discovery.

Shannon’s invention of information theory is on the other end of the spectrum.

Ask an early twentieth century communications engineer, and they’d probably tell you that different kinds of signals need different hardware. Sure, you could send Morse over a telephone line, but it would be wildly inefficient. To suggest that any signal can be sent with basically-the-same throughput over a given transmission channel would sound crazy to such an engineer; it wasn’t even in the space of things which most technical people considered.

So when Shannon showed up with information theory, ~nobody saw it coming at all. There was (as far as I know) no simultaneous discovery, nor anyone even close to figuring out the core ideas (i.e. fungibility of channel capacity). Shannon was probably not leading the parade; his impact was highly counterfactual.

Summary and upshot of all this:

  • It's pretty straightforward to read histories in order to look for counterfactual impact
  • ... but hardly anyone actually does that.
  • Standard narratives about which people/organizations were "highly impactful" seem pretty poorly correlated with counterfactual impact.
  • So I expect there's a bunch of low-hanging intellectual fruit here, to be picked by going back through histories of lots of people/organizations and actually looking for evidence for/against counterfactual impact (like e.g. simultaneous discovery), and then compiling new lists of famous people/orgs which were/weren't counterfactually impactful.
  • Absent that sort of project, I see claims like "This is how some of the most effective, transformative labs in the world have been organized, from Bell Labs to the MRC Laboratory of Molecular Biology", and I'm like... ummm, no, I'm not buying that these are some of the most effective, transformative labs in the world.
Comment by johnswentworth on A Shutdown Problem Proposal · 2024-01-25T19:45:27.970Z · LW · GW

Re: (2), that depends heavily on how the "shutdown utility function" handles those numbers. An "invest" action which costs 1 utility for subagent 1, and yields 10 utility for subagent 1 in each subsequent step, may have totally unrelated utilities for subagent 2. The subagents have different utility functions, and we don't have many constraints on the relationship between them.

Re: (1), yup, agreed.