Posts
Comments
Nitpick: you're talking about the discovery of the structure of DNA; it was already known at that time to be the particle which mediates inheritance IIRC.
I buy this argument.
I buy this argument.
I don't buy mathematical equivalence as an argument against, in this case, since the whole point of the path integral formulation is that it's mathematically equivalent but far simpler conceptually and computationally.
Man, that top one was a mess. Fixed now, thank you!
Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I've already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you're familiar with the history of any of these enough to say that they clearly were/weren't very counterfactual, please leave a comment.
- Noether's Theorem
- Mendel's Laws of Inheritance
- Godel's First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
- Feynman's path integral formulation of quantum mechanics
- Onnes' discovery of superconductivity
- Pauling's discovery of the alpha helix structure in proteins
- McClintock's work on transposons
- Observation of the cosmic microwave background
- Lorentz's work on deterministic chaos
- Prusiner's discovery of prions
- Yamanaka factors for inducing pluripotency
- Langmuir's adsorption isotherm (I have no idea what this is)
I somehow missed that John Wentworth and David Lorell are also in the middle of a sequence on this same topic here.
Yeah, uh... hopefully nobody's holding their breath waiting for the rest of that sequence. That was the original motivator, but we only wrote the one post and don't have any more in development yet.
Point is: please do write a good stat mech sequence, David and I are not really "on that ball" at the moment.
(Didn't read most of the dialogue, sorry if this was covered.)
But the way transformers work is they greedily think about the very next token, and predict that one, even if by conditioning on it you shot yourself in the foot for the task at hand.
That depends on how we sample from the LLM. If, at each "timestep", we take the most-probable token, then yes that's right.
But an LLM gives a distribution over tokens at each timestep, i.e. . If we sample from that distribution, rather than take the most-probable at each timestep, then that's equivalent to sampling non-greedily from the learned distribution over text. It's the chain rule:
Writing collaboratively is definitely something David and I have been trying to figure out how to do productively.
How sure are we that models will keeptracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don't have enough space and compute to actually track a distribution over latent states?
One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.
Yup, that was it, thankyou!
We're now working through understanding all the pieces of this, and we've calculated an MSP which doesn't quite look like the one in the post:
(Ignore the skew, David's still fiddling with the projection into 2D. The important noticeable part is the absence of "overlap" between the three copies of the main shape, compared to the fractal from the post.)
Specifically, each point in that visual corresponds to a distribution for some value of the observed symbols . The image itself is of the points on the probability simplex. From looking at a couple of Crutchfield papers, it sounds like that's what the MSP is supposed to be.
The update equations are:
with given by the transition probabilities, given by the observation probabilities, and a normalizer. We generate the image above by running initializing some random distribution , then iterating the equations and plotting each point.
Off the top of your head, any idea what might account for the mismatch (other than a bug in our code, which we're already checking)? Are we calculating the right thing, i.e. values of ? Are the transition and observation probabilities from the graphic in the post the same parameters used to generate the fractal? Is there some thing which people always forget to account for when calculating these things?
Can you elaborate on how the fractal is an artifact of how the data is visualized?
I don't know the details of the MSP, but my current understanding is that it's a general way of representing stochastic processes, and the MSP representation typically looks quite fractal. If we take two approximately-the-same stochastic processes, then they'll produce visually-similar fractals.
But the "fractal-ness" is mostly an artifact of the MSP as a representation-method IIUC; the stochastic process itself is not especially "naturally fractal".
(As I said I don't know the details of the MSP very well; my intuition here is instead coming from some background knowledge of where fractals which look like those often come from, specifically chaos games.)
That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.
A thing which is highly cruxy for me here, which I did not fully understand from the post: what exactly is the function which produces the fractal visual from the residual activations? My best guess from reading the post was that the activations are linearly regressed onto some kind of distribution, and then the distributions are represented in a particular way which makes smooth sets of distributions look fractal. If there's literally a linear projection of the residual stream into two dimensions which directly produces that fractal, with no further processing/transformation in between "linear projection" and "fractal", then I would change my mind about the fractal structure being mostly an artifact of the visualization method.
[EDIT: I no longer endorse this response, see thread.]
(This comment is mainly for people other than the authors.)
If your reaction to this post is "hot damn, look at that graph", then I think you should probably dial back your excitement somewhat. IIUC the fractal structure is largely an artifact of how the data is visualized, which means the results visually look more striking than they really are.
It is still a cool piece of work, and the visuals are beautiful. The correct amount of excitement is greater than zero.
Yup. Also, I'd add that entropy in this formulation increases exactly when more than one macrostate at time maps to the same actually-realized macrostate at time , i.e. when the macrostate evolution is not time-reversible.
This post was very specifically about a Boltzmann-style approach. I'd also generally consider the Gibbs/Shannon formula to be the "real" definition of entropy, and usually think of Boltzmann as the special case where the microstate distribution is constrained uniform. But a big point of this post was to be like "look, we can get surprisingly a lot (though not all) of thermo/stat mech even without actually bringing in any actual statistics, just restricting ourselves to the Boltzmann notion of entropy".
Meta: this comment is decidedly negative feedback, so needs the standard disclaimers. I don't know Ethan well, but I don't harbor any particular ill-will towards him. This comment is negative feedback about Ethan's skill in choosing projects in particular, I do not think others should mimic him in that department, but that does not mean that I think he's a bad person/researcher in general. I leave the comment mainly for the benefit of people who are not Ethan, so for Ethan: I am sorry for being not-nice to you here.
When I read the title, my first thought was "man, Ethan Perez sure is not someone I'd point to as an examplar of choosing good projects".
On reading the relevant section of the post, it sounds like Ethan's project-selection method is basically "forward-chain from what seems quick and easy, and also pay attention to whatever other people talk about". Which indeed sounds like a recipe for very mediocre projects: it's the sort of thing you'd expect a priori to reliably produce publications and be talked about, but have basically-zero counterfactual impact. These are the sorts of projects where someone else would likely have done something similar regardless, and it's not likely to change how people are thinking about things or building things; it's just generally going to add marginal effort to the prevailing milieu, whatever that might be.
From reading, I imagined a memory+cache structure instead of being closer to "cache all the way down".
Note that the things being cached are not things stored in memory elsewhere. Rather, they're (supposedly) outputs of costly-to-compute functions - e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than "from scratch" - e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.
Coherence of Caches and Agents goes into more detail on that part of the picture, if you're interested.
Very far through the graph representing the causal model, where we start from one or a few nodes representing the immediate observations.
You were talking about values and preferences in the previous paragraph, then suddenly switched to “beliefs”. Was that deliberate?
Yes.
... man, now that the post has been downvoted a bunch I feel bad for leaving such a snarky answer. It's a perfectly reasonable question, folks!
Overcompressed actual answer: core pieces of a standard doom-argument involve things like "killing all the humans will be very easy for a moderately-generally-smarter-than-human AI" and "killing all the humans (either as a subgoal or a side-effect of other things) is convergently instrumentally useful for the vast majority of terminal objectives". A standard doom counterargument usually doesn't dispute those two pieces (though there are of course exceptions); a standard doom counterargument usually argues that we'll have ample opportunity to iterate, and therefore it doesn't matter that the vast majority of terminal objectives instrumentally incentivize killing humans, we'll iterate until we find ways to avoid that sort of thing.
The standard core disagreement is then mostly about the extent to which we'll be able to iterate, or will in fact iterate in ways which actually help. In particular, cruxy subquestions tend to include:
- How visible will "bad behavior" be early on? Will there be "warning shots"? Will we have ways to detect unwanted internal structures?
- How sharply/suddenly will capabilities increase?
- Insofar as problems are visible, will labs and/or governments actually respond in useful ways?
Militarization isn't very centrally relevant to any of these; it's mostly relevant to things which are mostly not in doubt anyways, at least in the medium-to-long term.
Yes, I mean "mole" as in the unit from chemistry. I used it because I found it amusing.
Every algorithmic improvement is a one-time boost.
It doesn't.
Here's what it would typically look like in a control theory problem.
There's a long term utility which is a function of the final state , and a short term utility which is a function of time , the state at time , and the action at time . (Often the problem is formulated with a discount rate , but in this case we're allowing time-dependent short-term utility, so we can just absorb the discount rate into ). The objective is then to maximize
In that case, the value function is a max over trajectories starting at :
The key thing to notice is that we can solve that equation for :
So given an arbitrary value function , we can find a short-term utility function which produces that value function by using that equation to compute starting from the last timestep and working backwards.
Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
What if we restrict to only consider long-term utility, i.e. set ? Well, then the value function is no longer so arbitrary. That's the case considered in the post, where we have constraints which the value function must satisfy regardless of .
Did that clarify?
On the matter of software improvements potentially available during recursive self-improvement, we can look at the current pace of algorithmic improvement, which has been probably faster than scaling for some time now. So that's another lower bound on what AI will be capable of, assuming that the extrapolation holds up.
This is definitely a split which I think underlies a lot of differing intuitions about AGI and timelines. That said, the versions of each which are compatible with evidence/constraints generally have similar implications for at least the basics of AI risk (though they differ in predictions about what AI looks like "later on", once it's already far past eclipsing the capabilities of the human species).
Key relevant evidence/constraints, under my usual framing:
- We live in a very high dimensional environment. When doing science/optimization in such an environment, brute-force is search is exponentially intractable, so having e.g. ten billion humans running the same basic brute-force algorithm will not be qualitatively better than one human running a brute-force algorithm. The fact that less-than-exponentially-large numbers of humans are able to perform as well as we are implies that there's some real "general intelligence" going on in there somewhere.
- That said, it's still possible-in-principle for whatever general intelligence we have to be importantly distributed across humans. What the dimensionality argument rules out is a model in which humans' capabilities are just about brute-force trying lots of stuff, and then memetic spread of whatever works. The "trying stuff" step has to be doing "most of the work", in some sense, of finding good models/techniques/etc; but whatever process is doing that work could itself be load-bearingly spread across humans.
- Also, memetic spread could still be a bottleneck in practice, even if it's not "doing most of the work" in an algorithmic sense.
- A lower bound for what AI can do is "run lots of human-equivalent minds, and cheaply copy them". Even under a model where memetic spread is the main bottlenecking step for humans, AI will still be ridiculously better at that. You know that problem humans have where we spend tons of effort accumulating "tacit knowledge" which is hard to convey to the next generation? For AI, cheap copy means that problem is just completely gone.
- Humans' own historical progress/experience puts an upper bound on how hard it is to solve novel problems (not solved by society today). Humans have done... rather ridiculously a lot of that, over the past 250 years. That, in turn, lower bounds what AIs will be capable of.
Only if they both predictably painted that part purple, e.g. as part of the overall plan. If they both randomly happened to paint the same part purple, then no.
The main model I know of under which this matters much right now is: we're pretty close to AGI already, it's mostly a matter of figuring out the right scaffolding. Open-sourcing weights makes it a lot cheaper and easier for far more people to experiment with different scaffolding, thereby bringing AGI significantly closer in expectation. (As an example of someone who IIUC sees this as the mainline, I'd point to Connor Leahy.)
Sounds like I've maybe not communicated the thing about circularity. I'll try again, it would be useful to let me know whether or not this new explanation matches what you were already picturing from the previous one.
Let's think about circular definitions in terms of equations for a moment. We'll have two equations: one which "defines" in terms of , and one which "defines" in terms of :
Now, if , then (I claim) that's what we normally think of as a "circular definition". It's "pretending" to fully specify and , but in fact it doesn't, because one of the two equations is just a copy of the other equation but written differently. The practical problem, in this case, is that and are very underspecified by the supposed joint "definition".
But now suppose is not , and more generally the equations are not degenerate. Then our two equations are typically totally fine and useful, and indeed we use equations like this all the time in the sciences and they work great. Even though they're written in a "circular" way, they're substantively non-circular. (They might still allow for multiple solutions, but the solutions will typically at least be locally unique, so there's a discrete and typically relatively small set of solutions.)
That's the sort of thing which clustering algorithms do: they have some equations "defining" cluster-membership in terms of the data points and cluster parameters, and equations "defining" the cluster parameters in terms of the data points and the cluster-membership:
cluster_membership = (data, cluster_params)
cluster_params = (data, cluster_membership)
... where and are different (i.e. non-degenerate; is not just with data held constant). Together, these "definitions" specify a discrete and typically relatively small set of candidate (cluster_membership, cluster_params) values given some data.
That, I claim, is also part of what's going on with abstractions like "dog".
(Now, choice of axes is still a separate degree of freedom which has to be handled somehow. And that's where I expect the robustness to choice of axes does load-bearing work. As you say, that's separate from the circularity issue.)
As I mentioned at the end, it's not particularly relevant to my own models either way, so I don't particularly care. But I do think other people should want to run this experiment, based on their stated models.
That's only true if the Bellman equation in question allows for a "current payoff" at every timestep. That's the term which allows for totally arbitrary value functions, and not-coincidentally it's the term which does not reflect long-range goals/planning, just immediate payoff.
If we're interested in long-range goals/planning, then the natural thing to do is check how consistent the policy is with a Bellman equation without a payoff at each timestep - i.e. a value function just backpropagated from some goal at a much later time. That's what would make the check nontrivial: there exist policies which are not consistent with any assignment of values satisfying that Bellman equation. For example, the policy which chooses to transition from state A -> B with probability 1 over the option to stay at A with probability 1 (implying value B > value A for any values consistent with that policy), but also chooses to transition B -> A with probability 1 over the option to stay at B with probability 1 (implying value A > value B for any values consistent with that policy).
(There's still the trivial case where indifference could be interpreted as compatible with any policy, but that's easy to handle by adding a nontriviality requirement.)
I don't usually think about RL on MDPs, but it's an unusually easy setting in which to talk about coherence and its relationship to long-term-planning/goal-seeking/power-seeking.
Simplest starting point: suppose we're doing RL to learn a value function (i.e. mapping from states to values, or mapping from states x actions to values, whatever your preferred setup), with transition probabilities known. Well, in terms of optimal behavior, we know that the optimal value function for any objective in the far future will locally obey the Bellman equation with zero payoff in the immediate timestep: value of this state is equal to the max over actions of expected next-state value under that action. So insofar as we're interested in long-term goals specifically, there's an easy local check for the extent to which the value function "optimizes for" such long-term goals: just check how well it locally satisfies that Bellman equation.
From there, we can extend to gradually more complicated cases in ways which look similar to typical coherence theorems (like e.g. Dutch Book theorems). For instance, we could relax the requirement of known probabilities: we can ask whether there is any assignment of state-transition probabilities such that the values satisfy the Bellman equation.
As another example, if we're doing RL on a policy rather than value function, we can ask whether there exists any value function consistent with the policy such that the values satisfy the Bellman equation.
So that example SWE bench problem from the post:
... is that a prototypical problem from that benchmark? Because if so, that is a hilariously easy benchmark. Like, something could ace that task and still be coding at less than a CS 101 level.
(Though to be clear, people have repeatedly told me that a surprisingly high fraction of applicants for programming jobs can't do fizzbuzz, so even a very low level of competence would still put it above many would-be software engineers.)
Fixed, thanks.
Yeah, that's right.
The secret handshake is to start with " is independent of given " and " is independent of given ", expressed in this particular form:
... then we immediately see that for all such that .
So if there are no zero probabilities, then for all .
That, in turn, implies that takes on the same value for all Z, which in turn means that it's equal to . Thus and are independent. Likewise for and . Finally, we leverage independence of and given :
(A similar argument is in the middle of this post, along with a helpful-to-me visual.)
Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.
This is easiest to see if we use a "strong invariance" condition, in which each of the must mediate between and . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure () from any little spatially-localized chunk of the gas (). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e. only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.
I agree with this point as stated, but think the probability is more like 5% than 0.1%
Same.
I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.
Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"
That's not particularly cruxy for me either way.
Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".
Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.
Solid post!
I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.
Yup.
[EDIT April 5: I do not currently "have the ball" on this, so to anybody reading this who would go test it themselves if-and-only-if they don't see somebody else already on it: I am not on it.]
Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?
What was your old job?
Did you see the footnote I wrote on this? I give a further argument for it.
Ah yeah, I indeed missed that the first time through. I'd still say I don't buy it, but that's a more complicated discussion, and it is at least a decent argument.
I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.
This is another place where I'd say we don't understand it well enough to give a good formal definition or operationalization yet.
Though I'd note here, and also above w.r.t. search, that "we don't know how to give a good formal definition yet" is very different from "there is no good formal definition" or "the underlying intuitive concept is confused" or "we can't effectively study the concept at all" or "arguments which rely on this concept are necessarily wrong/uninformative". Every scientific field was pre-formal/pre-paradigmatic once.
To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.
That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.
Man, that is one hell of a bullet to bite. Much kudos for intellectual bravery and chutzpah!
That might be a fun topic for a longer discussion at some point, though not right now.
I would like to see a much more rigorous definition of "search" and why search would actually be "compressive" in the relevant sense for NN inductive biases. My current take is something like "a lot of the references to internal search on LW are just incoherent" and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward "search" in ways that are totally benign.
More generally, I'm quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there's a double dissociation between these things, where "mechanistic search" is almost always benign, and grabby consequentialism need not be backed by mechanistic search.
Some notes on this:
- I don't think general-purpose search is sufficiently well-understood yet to give a rigorous mechanistic definition. (Well, unless one just gives a very wrong definition.)
- Likewise, I don't think we understand either search or NN biases well enough yet to make a formal compression argument. Indeed, that sounds like a roughly-agent-foundations-complete problem.
- I'm pretty skeptical that internal general-purpose search is compressive in current architectures. (And this is one reason why I expect most AI x-risk to come from importantly-different future architectures.) Low confidence, though.
- Also, current architectures do have at least some "externalized" general-purpose search capabilities, insofar as they can mimic the "unrolled" search process of a human or group of humans thinking out loud. That general-purpose search process is basically AgentGPT. Notably, it doesn't work very well to date.
- Insofar as I need a working not-very-formal definition of general-purpose search, I usually use a behavioral definition: a system which can take in a representation of a problem in some fairly-broad class of problems (typically in a ~fixed environment), and solve it.
- The argument that a system which satisfies that behavioral definition will tend to also have an "explicit search-architecture", in some sense, comes from the recursive nature of problems. E.g. humans solve large novel problems by breaking them into subproblems, and then doing their general-purpose search/problem-solving on the subproblems; that's an explicit search architecture.
- I definitely agree that grabby consequentialism need not be backed by mechanistic search. More skeptical of the claim mechanistic search is usually benign, at least if by "mechanistic search" we mean general-purpose search (though I'd agree with a version of this which talks about a weaker notion of "search").
Also, one maybe relevant deeper point, since you seem familiar with some of the philosophical literature: IIUC the most popular way philosophers ground semantics is in the role played by some symbol/signal in the evolutionary environment. I view this approach as a sort of placeholder: it's definitely not the "right" way to ground semantics, but philosophy as a field is using it as a stand-in until people work out better models of grounding (regardless of whether the philosophers themselves know that they're doing so). This is potentially relevant to the "representation of a problem" part of general-purpose search.
I'm curious which parts of the Goal Realism section you find "philosophically confused," because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.
(I'll briefly comment on each section, feel free to double-click.)
Against Goal Realism: Huemer... indeed seems confused about all sorts of things, and I wouldn't consider either the "goal realism" or "goal reductionism" picture solid grounds for use of an indifference principle (not sure if we agree on that?). Separately, "reductionism as a general philosophical thesis" does not imply the thing you call "goal reductionism" - for instance one could reduce "goals" to some internal mechanistic thing, rather than thinking about "goals" behaviorally, and that would be just as valid for the general philosophical/scientific project of reductionism. (Not that I necessarily think that's the right way to do it.)
Goal Slots Are Expensive: just because it's "generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules" doesn't mean the end-to-end trained system will turn out non-modular. Biological organisms were trained end-to-end by evolution, yet they ended up very modular.
Inner Goals Would Be Irrelevant: I think the point this section was trying to make is something I'd classify as a pointer problem? I.e. the internal symbolic "goal" does not necessarily neatly correspond to anything in the environment at all. If that was the point, then I'm basically on-board, though I would mention that I'd expect evolution/SGD/cultural evolution/within-lifetime learning/etc to drive the internal symbolic "goal" to roughly match natural structures in the world. (Where "natural structures" cashes out in terms of natural latents, but that's a whole other conversation.)
Goal Realism Is Anti-Darwinian: Fodor obviously is deeply confused, but I think you've misdiagnosed what he's confused about. "The physical world has no room for goals with precise contents" is somewhere between wrong and a nonsequitur, depending on how we interpret the claim. "The problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter" is correct, but very incomplete as a response to Fodor.
Goal Reductionism Is Powerful: While most of this section sounds basically-correct as written, the last few sentences seem to be basically arguing for behaviorism for LLMs. There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.
This isn't a proper response to the post, but since I've occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:
- This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
- There does exist a version of the counting argument which basically works.
- The version which works routes through compression and/or singular learning theory.
- In particular, that version would talk about "goal-slots" (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the "counting argument for overfitting" from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
- Just remembered I walked through basically the good version of the counting argument in this section of What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
- The "Against Goal Realism" section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it's making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don't provide much direct evidence relevant to either of those.
Pretty decent post overall.
Edited, thanks.
Third, the nontrivial prediction of 20 here is about "compactly describable errors. "Mislabelling a large part of the time (but not most of the time)" is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you'd have a meaningful boost in generalization error, but that doesn't happen. Easy Bayes update against #20. (And if we can't agree on this, I don't see what we can agree on.)
I indeed disagree with that, and I see two levels of mistake here. At the object level, there's a mistake of not thinking through the gears. At the epistemic level, it looks like you're trying to apply the "what would I have expected in advance?" technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)
First, object-level: let's walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that's how we usually train them). If we relabel 1's as 7's 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its "real underlying uncertainty", which we'd expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.
What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky's #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (... insofar as we should update at all from this experiment, which we shouldn't very much.)
Second, epistemic-level: my best guess is that you're ignoring these gears because they're not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias.
Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. "Which things are we actually measuring?" is itself usually figured out (if it's figured out at all) by looking at data from the experiment.
Now, this is still compatible with using the "what would I have expected in advance?" technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is "this experiment will mostly measure some random-ass thing which has little to do with the model/theory I'm interested in, and I'll have to dig through the details of the experiment and results to figure out what it measured". If one tries to apply the "what would I have expected in advance?" technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.
- ^
Standard disclaimer about guessing what's going on inside other peoples' heads being hard, you have more data than I on what's in your head, etc.
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I'll write Doomimir's part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So ... why doesn't that happen?
Let's start with this.
Short answer: because those aren't actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they're working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that... have you ever personally done an exercise where you try to convince someone to do something they don't want to do, or aren't supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner's job was to not open their fist, while the other partner's job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar - not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you're not actually going to pay someone right now, then it's far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using "threats and bribes" as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: "Isn't it kinda fishy that the supposed failures on which your claim rests are 'illegible'?"
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
The iterative design loop hasn't failed yet.
Now that's a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the "observe/orient" steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don't necessarily know that it's failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don't know a problem is there, or don't orient toward the right thing, and therefore we don't iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
- There is some widely-believed falsehood. The generative model might "know" the truth, from having trained on plenty of papers by actual experts, but the raters don't know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world's subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
- There is some even-more-widely-believed falsehood, such that even the so-called "experts" haven't figured out yet that it's false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
- Neither raters nor developers have time to check the models' citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
- On various kinds of "which option should I pick" questions, there's an option which results in marginally more slave labor, or factory farming, or what have you - terrible things which a user might strongly prefer to avoid, but it's extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don't reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
- This is the sort of problem which, in the high-capability regime, especially leads to "Potemkin village world".
- On various kinds of "which option should I pick" questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don't even attempt to test which options are best long-term, they just read the LLM's response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
- On various kinds of "which option should I pick" questions, there are things which sound fun or are marketed as fun, but which humans mostly don't actually enjoy (or don't enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
... and so forth. The unifying theme here is that when these failures occur, it is not obvious that they've occurred.
This makes empirical study tricky - not impossible, but it's easy to be mislead by experimental procedures which don't actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
They varied the size of the KL penalty of an LLM RLHF'd for a summarization task, and found about what you'd expect from the vague handwaving: as the KL penalty decreases, the reward model's predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...
(Bolding mine.) As you say, one could spin that as demonstrating "yet another portent of our impending deaths", but really this paper just isn't measuring the most relevant things in the first place. It's still using human ratings as the evaluation mechanism, so it's not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that's "better than" the labels used for training.
OpenAI's weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a "ground truth" which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven't put that much care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most "coming decoupled from reality" problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That's what tends to happen, in fields where people don't have a deep understanding of the systems they're working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that's just the experimentalist themselves - this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists' interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure - John Boyd, the guy who introduced the term "OODA loop", talked a lot about that sort of failure.)
To summarize the main points here:
- Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (... and then things are hard.)
- AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
- So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: "But how does that lead to extinction?". That is one of the steps which has been least well-explained historically; someone with your "unexpectedly low polygenic scores" can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you... <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I'll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
Is the Waldo picture at the end supposed to be Holden, or is that accidental?