Posts

Examples of Highly Counterfactual Discoveries? 2024-04-23T22:19:19.399Z
Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer 2024-04-18T00:27:43.451Z
Generalized Stat Mech: The Boltzmann Approach 2024-04-12T17:47:31.880Z
How We Picture Bayesian Agents 2024-04-08T18:12:48.595Z
Coherence of Caches and Agents 2024-04-01T23:04:31.320Z
Natural Latents: The Concepts 2024-03-20T18:21:19.878Z
The Worst Form Of Government (Except For Everything Else We've Tried) 2024-03-17T18:11:38.374Z
The Parable Of The Fallen Pendulum - Part 2 2024-03-12T21:41:30.180Z
The Parable Of The Fallen Pendulum - Part 1 2024-03-01T00:25:00.111Z
Leading The Parade 2024-01-31T22:39:56.499Z
A Shutdown Problem Proposal 2024-01-21T18:12:48.664Z
Some Vacation Photos 2024-01-04T17:15:01.187Z
Apologizing is a Core Rationalist Skill 2024-01-02T17:47:35.950Z
The Plan - 2023 Version 2023-12-29T23:34:19.651Z
Natural Latents: The Math 2023-12-27T19:03:01.923Z
Talk: "AI Would Be A Lot Less Alarming If We Understood Agents" 2023-12-17T23:46:32.814Z
Principles For Product Liability (With Application To AI) 2023-12-10T21:27:41.403Z
What I Would Do If I Were Working On AI Governance 2023-12-08T06:43:42.565Z
On Trust 2023-12-06T19:19:07.680Z
Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" 2023-11-21T17:39:17.828Z
On the lethality of biased human reward ratings 2023-11-17T18:59:02.303Z
Some Rules for an Algebra of Bayes Nets 2023-11-16T23:53:11.650Z
Symbol/Referent Confusions in Language Model Alignment Experiments 2023-10-26T19:49:00.718Z
What's Hard About The Shutdown Problem 2023-10-20T21:13:27.624Z
Trying to understand John Wentworth's research agenda 2023-10-20T00:05:40.929Z
Bids To Defer On Value Judgements 2023-09-29T17:07:25.834Z
Inside Views, Impostor Syndrome, and the Great LARP 2023-09-25T16:08:17.040Z
Atoms to Agents Proto-Lectures 2023-09-22T06:22:05.456Z
What's A "Market"? 2023-08-08T23:29:24.722Z
Yes, It's Subjective, But Why All The Crabs? 2023-07-28T19:35:36.741Z
Alignment Grantmaking is Funding-Limited Right Now 2023-07-19T16:49:08.811Z
Why Not Subagents? 2023-06-22T22:16:55.249Z
Lessons On How To Get Things Right On The First Try 2023-06-19T23:58:09.605Z
Algorithmic Improvement Is Probably Faster Than Scaling Now 2023-06-06T02:57:33.700Z
$500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions 2023-05-16T21:31:35.490Z
The Lightcone Theorem: A Better Foundation For Natural Abstraction? 2023-05-15T02:24:02.038Z
Result Of The Bounty/Contest To Explain Infra-Bayes In The Language Of Game Theory 2023-05-09T16:35:26.751Z
How Many Bits Of Optimization Can One Bit Of Observation Unlock? 2023-04-26T00:26:22.902Z
Why Are Maximum Entropy Distributions So Ubiquitous? 2023-04-05T20:12:57.748Z
Shannon's Surprising Discovery 2023-03-30T20:15:54.065Z
A Primer On Chaos 2023-03-28T18:01:30.702Z
$500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory 2023-03-25T17:29:51.498Z
Why Not Just Outsource Alignment Research To An AI? 2023-03-09T21:49:19.774Z
Why Not Just... Build Weak AI Tools For AI Alignment Research? 2023-03-05T00:12:33.651Z
Scarce Channels and Abstraction Coupling 2023-02-28T23:26:03.539Z
Wentworth and Larsen on buying time 2023-01-09T21:31:24.911Z
The Feeling of Idea Scarcity 2022-12-31T17:34:04.306Z
What do you imagine, when you imagine "taking over the world"? 2022-12-31T01:04:02.370Z
Applied Linear Algebra Lecture Series 2022-12-22T06:57:26.643Z
The "Minimal Latents" Approach to Natural Abstractions 2022-12-20T01:22:25.101Z

Comments

Comment by johnswentworth on Examples of Highly Counterfactual Discoveries? · 2024-04-24T15:38:56.964Z · LW · GW

Nitpick: you're talking about the discovery of the structure of DNA; it was already known at that time to be the particle which mediates inheritance IIRC.

Comment by johnswentworth on Examples of Highly Counterfactual Discoveries? · 2024-04-24T15:30:12.054Z · LW · GW

I buy this argument.

Comment by johnswentworth on Examples of Highly Counterfactual Discoveries? · 2024-04-24T15:29:55.205Z · LW · GW

I buy this argument.

Comment by johnswentworth on Examples of Highly Counterfactual Discoveries? · 2024-04-24T15:28:47.424Z · LW · GW

I don't buy mathematical equivalence as an argument against, in this case, since the whole point of the path integral formulation is that it's mathematically equivalent but far simpler conceptually and computationally.

Comment by johnswentworth on Some Rules for an Algebra of Bayes Nets · 2024-04-23T22:59:58.350Z · LW · GW

Man, that top one was a mess. Fixed now, thank you!

Comment by johnswentworth on Examples of Highly Counterfactual Discoveries? · 2024-04-23T22:30:53.196Z · LW · GW

Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I've already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you're familiar with the history of any of these enough to say that they clearly were/weren't very counterfactual, please leave a comment.

  • Noether's Theorem
  • Mendel's Laws of Inheritance
  • Godel's First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
  • Feynman's path integral formulation of quantum mechanics
  • Onnes' discovery of superconductivity
  • Pauling's discovery of the alpha helix structure in proteins
  • McClintock's work on transposons
  • Observation of the cosmic microwave background
  • Lorentz's work on deterministic chaos
  • Prusiner's discovery of prions
  • Yamanaka factors for inducing pluripotency
  • Langmuir's adsorption isotherm (I have no idea what this is)
Comment by johnswentworth on Forget Everything (Statistical Mechanics Part 1) · 2024-04-22T16:37:20.752Z · LW · GW

I somehow missed that John Wentworth and David Lorell are also in the middle of a sequence on this same topic here.

Yeah, uh... hopefully nobody's holding their breath waiting for the rest of that sequence. That was the original motivator, but we only wrote the one post and don't have any more in development yet.

Point is: please do write a good stat mech sequence, David and I are not really "on that ball" at the moment.

Comment by johnswentworth on Goal oriented cognition in "a single forward pass" · 2024-04-22T05:45:09.462Z · LW · GW

(Didn't read most of the dialogue, sorry if this was covered.)

But the way transformers work is they greedily think about the very next token, and predict that one, even if by conditioning on it you shot yourself in the foot for the task at hand.

That depends on how we sample from the LLM. If, at each "timestep", we take the most-probable token, then yes that's right.

But an LLM gives a distribution over tokens at each timestep, i.e. . If we sample from that distribution, rather than take the most-probable at each timestep, then that's equivalent to sampling non-greedily from the learned distribution over text. It's the chain rule:

Comment by johnswentworth on LessOnline Festival Updates Thread · 2024-04-20T03:34:29.112Z · LW · GW

Writing collaboratively is definitely something David and I have been trying to figure out how to do productively.

Comment by johnswentworth on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-04-18T15:53:04.022Z · LW · GW

How sure are we that models will keeptracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don't have enough space and compute to actually track a distribution over latent states?

One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.

Comment by johnswentworth on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T22:25:12.060Z · LW · GW

Yup, that was it, thankyou!

Comment by johnswentworth on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T21:04:59.638Z · LW · GW

We're now working through understanding all the pieces of this, and we've calculated an MSP which doesn't quite look like the one in the post:

(Ignore the skew, David's still fiddling with the projection into 2D. The important noticeable part is the absence of "overlap" between the three copies of the main shape, compared to the fractal from the post.)

Specifically, each point in that visual corresponds to a distribution  for some value of the observed symbols . The image itself is of the points on the probability simplex. From looking at a couple of Crutchfield papers, it sounds like that's what the MSP is supposed to be.

The update equations are:

with  given by the transition probabilities,  given by the observation probabilities, and  a normalizer. We generate the image above by running initializing some random distribution , then iterating the equations and plotting each point.

Off the top of your head, any idea what might account for the mismatch (other than a bug in our code, which we're already checking)? Are we calculating the right thing, i.e. values of  ? Are the transition and observation probabilities from the graphic in the post the same parameters used to generate the fractal? Is there some thing which people always forget to account for when calculating these things?

Comment by johnswentworth on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T02:18:30.450Z · LW · GW

Can you elaborate on how the fractal is an artifact of how the data is visualized?

I don't know the details of the MSP, but my current understanding is that it's a general way of representing stochastic processes, and the MSP representation typically looks quite fractal. If we take two approximately-the-same stochastic processes, then they'll produce visually-similar fractals.

But the "fractal-ness" is mostly an artifact of the MSP as a representation-method IIUC; the stochastic process itself is not especially "naturally fractal".

(As I said I don't know the details of the MSP very well; my intuition here is instead coming from some background knowledge of where fractals which look like those often come from, specifically chaos games.)

That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.

A thing which is highly cruxy for me here, which I did not fully understand from the post: what exactly is the function which produces the fractal visual from the residual activations? My best guess from reading the post was that the activations are linearly regressed onto some kind of distribution, and then the distributions are represented in a particular way which makes smooth sets of distributions look fractal. If there's literally a linear projection of the residual stream into two dimensions which directly produces that fractal, with no further processing/transformation in between "linear projection" and "fractal", then I would change my mind about the fractal structure being mostly an artifact of the visualization method.

Comment by johnswentworth on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T01:35:03.240Z · LW · GW

[EDIT: I no longer endorse this response, see thread.]

(This comment is mainly for people other than the authors.)

If your reaction to this post is "hot damn, look at that graph", then I think you should probably dial back your excitement somewhat. IIUC the fractal structure is largely an artifact of how the data is visualized, which means the results visually look more striking than they really are.

It is still a cool piece of work, and the visuals are beautiful. The correct amount of excitement is greater than zero.

Comment by johnswentworth on Generalized Stat Mech: The Boltzmann Approach · 2024-04-13T17:27:39.047Z · LW · GW

Yup. Also, I'd add that entropy in this formulation increases exactly when more than one macrostate at time  maps to the same actually-realized macrostate at time , i.e. when the macrostate evolution is not time-reversible.

Comment by johnswentworth on Generalized Stat Mech: The Boltzmann Approach · 2024-04-13T00:00:41.137Z · LW · GW

This post was very specifically about a Boltzmann-style approach. I'd also generally consider the Gibbs/Shannon formula to be the "real" definition of entropy, and usually think of Boltzmann as the special case where the microstate distribution is constrained uniform. But a big point of this post was to be like "look, we can get surprisingly a lot (though not all) of thermo/stat mech even without actually bringing in any actual statistics, just restricting ourselves to the Boltzmann notion of entropy".

Comment by johnswentworth on How I select alignment research projects · 2024-04-12T15:52:20.683Z · LW · GW

Meta: this comment is decidedly negative feedback, so needs the standard disclaimers. I don't know Ethan well, but I don't harbor any particular ill-will towards him. This comment is negative feedback about Ethan's skill in choosing projects in particular, I do not think others should mimic him in that department, but that does not mean that I think he's a bad person/researcher in general. I leave the comment mainly for the benefit of people who are not Ethan, so for Ethan: I am sorry for being not-nice to you here.


When I read the title, my first thought was "man, Ethan Perez sure is not someone I'd point to as an examplar of choosing good projects".

On reading the relevant section of the post, it sounds like Ethan's project-selection method is basically "forward-chain from what seems quick and easy, and also pay attention to whatever other people talk about". Which indeed sounds like a recipe for very mediocre projects: it's the sort of thing you'd expect a priori to reliably produce publications and be talked about, but have basically-zero counterfactual impact. These are the sorts of projects where someone else would likely have done something similar regardless, and it's not likely to change how people are thinking about things or building things; it's just generally going to add marginal effort to the prevailing milieu, whatever that might be.

Comment by johnswentworth on How We Picture Bayesian Agents · 2024-04-09T23:05:31.402Z · LW · GW

From reading, I imagined a memory+cache structure instead of being closer to "cache all the way down".

Note that the things being cached are not things stored in memory elsewhere. Rather, they're (supposedly) outputs of costly-to-compute functions - e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than "from scratch" - e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.

Coherence of Caches and Agents goes into more detail on that part of the picture, if you're interested.

Comment by johnswentworth on How We Picture Bayesian Agents · 2024-04-09T23:00:43.899Z · LW · GW

Very far through the graph representing the causal model, where we start from one or a few nodes representing the immediate observations.

Comment by johnswentworth on How We Picture Bayesian Agents · 2024-04-08T20:47:24.093Z · LW · GW

You were talking about values and preferences in the previous paragraph, then suddenly switched to “beliefs”. Was that deliberate?

Yes.

Comment by johnswentworth on How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)? · 2024-04-08T18:51:08.964Z · LW · GW

... man, now that the post has been downvoted a bunch I feel bad for leaving such a snarky answer. It's a perfectly reasonable question, folks!

Overcompressed actual answer: core pieces of a standard doom-argument involve things like "killing all the humans will be very easy for a moderately-generally-smarter-than-human AI" and "killing all the humans (either as a subgoal or a side-effect of other things) is convergently instrumentally useful for the vast majority of terminal objectives". A standard doom counterargument usually doesn't dispute those two pieces (though there are of course exceptions); a standard doom counterargument usually argues that we'll have ample opportunity to iterate, and therefore it doesn't matter that the vast majority of terminal objectives instrumentally incentivize killing humans, we'll iterate until we find ways to avoid that sort of thing.

The standard core disagreement is then mostly about the extent to which we'll be able to iterate, or will in fact iterate in ways which actually help. In particular, cruxy subquestions tend to include:

  • How visible will "bad behavior" be early on? Will there be "warning shots"? Will we have ways to detect unwanted internal structures?
  • How sharply/suddenly will capabilities increase?
  • Insofar as problems are visible, will labs and/or governments actually respond in useful ways?

Militarization isn't very centrally relevant to any of these; it's mostly relevant to things which are mostly not in doubt anyways, at least in the medium-to-long term.

Comment by johnswentworth on Algorithmic Improvement Is Probably Faster Than Scaling Now · 2024-04-07T00:10:09.415Z · LW · GW

Yes, I mean "mole" as in the unit from chemistry. I used it because I found it amusing.

Comment by johnswentworth on Algorithmic Improvement Is Probably Faster Than Scaling Now · 2024-04-07T00:09:02.989Z · LW · GW

Every algorithmic improvement is a one-time boost.

Comment by johnswentworth on How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)? · 2024-04-06T16:28:11.125Z · LW · GW

It doesn't.

Comment by johnswentworth on Coherence of Caches and Agents · 2024-04-02T18:46:14.763Z · LW · GW

Here's what it would typically look like in a control theory problem.

There's a long term utility  which is a function of the final state , and a short term utility  which is a function of time , the state  at time , and the action  at time . (Often the problem is formulated with a discount rate  , but in this case we're allowing time-dependent short-term utility, so we can just absorb the discount rate into ). The objective is then to maximize

In that case, the value function  is a max over trajectories starting at :

The key thing to notice is that we can solve that equation for :

So given an arbitrary value function , we can find a short-term utility function  which produces that value function by using that equation to compute  starting from the last timestep and working backwards.

Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.

What if we restrict to only consider long-term utility, i.e. set ? Well, then the value function is no longer so arbitrary. That's the case considered in the post, where we have constraints which the value function must satisfy regardless of .

Did that clarify?

Comment by johnswentworth on What is the nature of humans general intelligence and it's implications for AGI? · 2024-03-26T20:45:37.721Z · LW · GW

On the matter of software improvements potentially available during recursive self-improvement, we can look at the current pace of algorithmic improvement, which has been probably faster than scaling for some time now. So that's another lower bound on what AI will be capable of, assuming that the extrapolation holds up.

Comment by johnswentworth on What is the nature of humans general intelligence and it's implications for AGI? · 2024-03-26T16:57:05.597Z · LW · GW

This is definitely a split which I think underlies a lot of differing intuitions about AGI and timelines. That said, the versions of each which are compatible with evidence/constraints generally have similar implications for at least the basics of AI risk (though they differ in predictions about what AI looks like "later on", once it's already far past eclipsing the capabilities of the human species).

Key relevant evidence/constraints, under my usual framing:

  • We live in a very high dimensional environment. When doing science/optimization in such an environment, brute-force is search is exponentially intractable, so having e.g. ten billion humans running the same basic brute-force algorithm will not be qualitatively better than one human running a brute-force algorithm. The fact that less-than-exponentially-large numbers of humans are able to perform as well as we are implies that there's some real "general intelligence" going on in there somewhere.
    • That said, it's still possible-in-principle for whatever general intelligence we have to be importantly distributed across humans. What the dimensionality argument rules out is a model in which humans' capabilities are just about brute-force trying lots of stuff, and then memetic spread of whatever works. The "trying stuff" step has to be doing "most of the work", in some sense, of finding good models/techniques/etc; but whatever process is doing that work could itself be load-bearingly spread across humans.
    • Also, memetic spread could still be a bottleneck in practice, even if it's not "doing most of the work" in an algorithmic sense.
  • A lower bound for what AI can do is "run lots of human-equivalent minds, and cheaply copy them". Even under a model where memetic spread is the main bottlenecking step for humans, AI will still be ridiculously better at that. You know that problem humans have where we spend tons of effort accumulating "tacit knowledge" which is hard to convey to the next generation? For AI, cheap copy means that problem is just completely gone.
  • Humans' own historical progress/experience puts an upper bound on how hard it is to solve novel problems (not solved by society today). Humans have done... rather ridiculously a lot of that, over the past 250 years. That, in turn, lower bounds what AIs will be capable of.
Comment by johnswentworth on Natural Latents: The Concepts · 2024-03-24T16:10:49.548Z · LW · GW

Only if they both predictably painted that part purple, e.g. as part of the overall plan. If they both randomly happened to paint the same part purple, then no.

Comment by johnswentworth on D0TheMath's Shortform · 2024-03-22T21:45:04.971Z · LW · GW

The main model I know of under which this matters much right now is: we're pretty close to AGI already, it's mostly a matter of figuring out the right scaffolding. Open-sourcing weights makes it a lot cheaper and easier for far more people to experiment with different scaffolding, thereby bringing AGI significantly closer in expectation. (As an example of someone who IIUC sees this as the mainline, I'd point to Connor Leahy.)

Comment by johnswentworth on Natural Latents: The Concepts · 2024-03-22T15:45:11.933Z · LW · GW
Comment by johnswentworth on Natural Latents: The Concepts · 2024-03-22T15:39:35.076Z · LW · GW

Sounds like I've maybe not communicated the thing about circularity. I'll try again, it would be useful to let me know whether or not this new explanation matches what you were already picturing from the previous one.

Let's think about circular definitions in terms of equations for a moment. We'll have two equations: one which "defines"  in terms of , and one which "defines"  in terms of :

Now, if , then (I claim) that's what we normally think of as a "circular definition". It's "pretending" to fully specify  and , but in fact it doesn't, because one of the two equations is just a copy of the other equation but written differently. The practical problem, in this case, is that  and  are very underspecified by the supposed joint "definition".

But now suppose  is not , and more generally the equations are not degenerate. Then our two equations are typically totally fine and useful, and indeed we use equations like this all the time in the sciences and they work great. Even though they're written in a "circular" way, they're substantively non-circular. (They might still allow for multiple solutions, but the solutions will typically at least be locally unique, so there's a discrete and typically relatively small set of solutions.)

That's the sort of thing which clustering algorithms do: they have some equations "defining" cluster-membership in terms of the data points and cluster parameters, and equations "defining" the cluster parameters in terms of the data points and the cluster-membership:

cluster_membership = (data, cluster_params)

cluster_params = (data, cluster_membership)

... where  and  are different (i.e. non-degenerate;  is not just  with data held constant). Together, these "definitions" specify a discrete and typically relatively small set of candidate (cluster_membership, cluster_params) values given some data.

That, I claim, is also part of what's going on with abstractions like "dog".

(Now, choice of axes is still a separate degree of freedom which has to be handled somehow. And that's where I expect the robustness to choice of axes does load-bearing work. As you say, that's separate from the circularity issue.)

Comment by johnswentworth on Alignment Implications of LLM Successes: a Debate in One Act · 2024-03-20T20:29:58.438Z · LW · GW

As I mentioned at the end, it's not particularly relevant to my own models either way, so I don't particularly care. But I do think other people should want to run this experiment, based on their stated models.

Comment by johnswentworth on Measuring Coherence of Policies in Toy Environments · 2024-03-20T18:27:20.978Z · LW · GW

That's only true if the Bellman equation in question allows for a "current payoff" at every timestep. That's the term which allows for totally arbitrary value functions, and not-coincidentally it's the term which does not reflect long-range goals/planning, just immediate payoff.

If we're interested in long-range goals/planning, then the natural thing to do is check how consistent the policy is with a Bellman equation without a payoff at each timestep - i.e. a value function just backpropagated from some goal at a much later time. That's what would make the check nontrivial: there exist policies which are not consistent with any assignment of values satisfying that Bellman equation. For example, the policy which chooses to transition from state A -> B with probability 1 over the option to stay at A with probability 1 (implying value B > value A for any values consistent with that policy), but also chooses to transition B -> A with probability 1 over the option to stay at B with probability 1 (implying value A > value B for any values consistent with that policy).

(There's still the trivial case where indifference could be interpreted as compatible with any policy, but that's easy to handle by adding a nontriviality requirement.)

Comment by johnswentworth on Measuring Coherence of Policies in Toy Environments · 2024-03-20T16:42:50.728Z · LW · GW

I don't usually think about RL on MDPs, but it's an unusually easy setting in which to talk about coherence and its relationship to long-term-planning/goal-seeking/power-seeking.

Simplest starting point: suppose we're doing RL to learn a value function (i.e. mapping from states to values, or mapping from states x actions to values, whatever your preferred setup), with transition probabilities known. Well, in terms of optimal behavior, we know that the optimal value function for any objective in the far future will locally obey the Bellman equation with zero payoff in the immediate timestep: value of this state is equal to the max over actions of expected next-state value under that action. So insofar as we're interested in long-term goals specifically, there's an easy local check for the extent to which the value function "optimizes for" such long-term goals: just check how well it locally satisfies that Bellman equation.

From there, we can extend to gradually more complicated cases in ways which look similar to typical coherence theorems (like e.g. Dutch Book theorems). For instance, we could relax the requirement of known probabilities: we can ask whether there is any assignment of state-transition probabilities such that the values satisfy the Bellman equation.

As another example, if we're doing RL on a policy rather than value function, we can ask whether there exists any value function consistent with the policy such that the values satisfy the Bellman equation.

Comment by johnswentworth on On Devin · 2024-03-18T15:50:14.465Z · LW · GW

So that example SWE bench problem from the post:

... is that a prototypical problem from that benchmark? Because if so, that is a hilariously easy benchmark. Like, something could ace that task and still be coding at less than a CS 101 level.

(Though to be clear, people have repeatedly told me that a surprisingly high fraction of applicants for programming jobs can't do fizzbuzz, so even a very low level of competence would still put it above many would-be software engineers.)

Comment by johnswentworth on The Parable Of The Fallen Pendulum - Part 2 · 2024-03-13T20:02:20.783Z · LW · GW

Fixed, thanks.

Comment by johnswentworth on Natural Latents: The Math · 2024-03-08T23:52:09.698Z · LW · GW

Yeah, that's right.

The secret handshake is to start with " is independent of  given " and " is independent of  given ", expressed in this particular form:

... then we immediately see that  for all  such that .

So if there are no zero probabilities, then  for all .

That, in turn, implies that  takes on the same value for all Z, which in turn means that it's equal to .  Thus  and  are independent. Likewise for  and . Finally, we leverage independence of  and  given :

(A similar argument is in the middle of this post, along with a helpful-to-me visual.)

Comment by johnswentworth on Natural Latents: The Math · 2024-03-08T19:08:06.806Z · LW · GW

Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.

This is easiest to see if we use a "strong invariance" condition, in which each of the  must mediate between  and . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure () from any little spatially-localized chunk of the gas (). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e.  only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.

Comment by johnswentworth on Many arguments for AI x-risk are wrong · 2024-03-05T16:53:24.720Z · LW · GW

I agree with this point as stated, but think the probability is more like 5% than 0.1%

Same.

I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.

Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"

That's not particularly cruxy for me either way.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".

Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.

Comment by johnswentworth on Many arguments for AI x-risk are wrong · 2024-03-05T03:57:35.153Z · LW · GW

Solid post!

I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.

Comment by johnswentworth on Increasing IQ is trivial · 2024-03-02T01:24:06.061Z · LW · GW

Yup.

[EDIT April 5: I do not currently "have the ball" on this, so to anybody reading this who would go test it themselves if-and-only-if they don't see somebody else already on it: I am not on it.]

Comment by johnswentworth on Increasing IQ is trivial · 2024-03-02T00:27:08.618Z · LW · GW

Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?

Comment by johnswentworth on The Parable Of The Fallen Pendulum - Part 1 · 2024-03-01T20:19:43.645Z · LW · GW

What was your old job?

Comment by johnswentworth on Counting arguments provide no evidence for AI doom · 2024-02-28T03:41:12.059Z · LW · GW

Did you see the footnote I wrote on this? I give a further argument for it.

Ah yeah, I indeed missed that the first time through. I'd still say I don't buy it, but that's a more complicated discussion, and it is at least a decent argument.

I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.

This is another place where I'd say we don't understand it well enough to give a good formal definition or operationalization yet.

Though I'd note here, and also above w.r.t. search, that "we don't know how to give a good formal definition yet" is very different from "there is no good formal definition" or "the underlying intuitive concept is confused" or "we can't effectively study the concept at all" or "arguments which rely on this concept are necessarily wrong/uninformative". Every scientific field was pre-formal/pre-paradigmatic once.

To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.

That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Man, that is one hell of a bullet to bite. Much kudos for intellectual bravery and chutzpah!

That might be a fun topic for a longer discussion at some point, though not right now.

Comment by johnswentworth on Counting arguments provide no evidence for AI doom · 2024-02-28T02:27:21.246Z · LW · GW

I would like to see a much more rigorous definition of "search" and why search would actually be "compressive" in the relevant sense for NN inductive biases. My current take is something like "a lot of the references to internal search on LW are just incoherent" and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward "search" in ways that are totally benign.

More generally, I'm quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there's a double dissociation between these things, where "mechanistic search" is almost always benign, and grabby consequentialism need not be backed by mechanistic search.

Some notes on this:

  • I don't think general-purpose search is sufficiently well-understood yet to give a rigorous mechanistic definition. (Well, unless one just gives a very wrong definition.)
  • Likewise, I don't think we understand either search or NN biases well enough yet to make a formal compression argument. Indeed, that sounds like a roughly-agent-foundations-complete problem.
  • I'm pretty skeptical that internal general-purpose search is compressive in current architectures. (And this is one reason why I expect most AI x-risk to come from importantly-different future architectures.) Low confidence, though.
    • Also, current architectures do have at least some "externalized" general-purpose search capabilities, insofar as they can mimic the "unrolled" search process of a human or group of humans thinking out loud. That general-purpose search process is basically AgentGPT. Notably, it doesn't work very well to date.
  • Insofar as I need a working not-very-formal definition of general-purpose search, I usually use a behavioral definition: a system which can take in a representation of a problem in some fairly-broad class of problems (typically in a ~fixed environment), and solve it.
  • The argument that a system which satisfies that behavioral definition will tend to also have an "explicit search-architecture", in some sense, comes from the recursive nature of problems. E.g. humans solve large novel problems by breaking them into subproblems, and then doing their general-purpose search/problem-solving on the subproblems; that's an explicit search architecture.
  • I definitely agree that grabby consequentialism need not be backed by mechanistic search. More skeptical of the claim mechanistic search is usually benign, at least if by "mechanistic search" we mean general-purpose search (though I'd agree with a version of this which talks about a weaker notion of "search").

Also, one maybe relevant deeper point, since you seem familiar with some of the philosophical literature: IIUC the most popular way philosophers ground semantics is in the role played by some symbol/signal in the evolutionary environment. I view this approach as a sort of placeholder: it's definitely not the "right" way to ground semantics, but philosophy as a field is using it as a stand-in until people work out better models of grounding (regardless of whether the philosophers themselves know that they're doing so). This is potentially relevant to the "representation of a problem" part of general-purpose search.

I'm curious which parts of the Goal Realism section you find "philosophically confused," because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.

(I'll briefly comment on each section, feel free to double-click.)

Against Goal Realism: Huemer... indeed seems confused about all sorts of things, and I wouldn't consider either the "goal realism" or "goal reductionism" picture solid grounds for use of an indifference principle (not sure if we agree on that?). Separately, "reductionism as a general philosophical thesis" does not imply the thing you call "goal reductionism" - for instance one could reduce "goals" to some internal mechanistic thing, rather than thinking about "goals" behaviorally, and that would be just as valid for the general philosophical/scientific project of reductionism. (Not that I necessarily think that's the right way to do it.)

Goal Slots Are Expensive: just because it's "generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules" doesn't mean the end-to-end trained system will turn out non-modular. Biological organisms were trained end-to-end by evolution, yet they ended up very modular.

Inner Goals Would Be Irrelevant: I think the point this section was trying to make is something I'd classify as a pointer problem? I.e. the internal symbolic "goal" does not necessarily neatly correspond to anything in the environment at all. If that was the point, then I'm basically on-board, though I would mention that I'd expect evolution/SGD/cultural evolution/within-lifetime learning/etc to drive the internal symbolic "goal" to roughly match natural structures in the world. (Where "natural structures" cashes out in terms of natural latents, but that's a whole other conversation.)

Goal Realism Is Anti-Darwinian: Fodor obviously is deeply confused, but I think you've misdiagnosed what he's confused about. "The physical world has no room for goals with precise contents" is somewhere between wrong and a nonsequitur, depending on how we interpret the claim. "The problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter" is correct, but very incomplete as a response to Fodor.

Goal Reductionism Is Powerful: While most of this section sounds basically-correct as written, the last few sentences seem to be basically arguing for behaviorism for LLMs. There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.

Comment by johnswentworth on Counting arguments provide no evidence for AI doom · 2024-02-28T01:10:33.365Z · LW · GW

This isn't a proper response to the post, but since I've occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:

  • This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
  • There does exist a version of the counting argument which basically works.
  • The version which works routes through compression and/or singular learning theory.
  • In particular, that version would talk about "goal-slots" (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the "counting argument for overfitting" from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
  • The "Against Goal Realism" section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it's making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don't provide much direct evidence relevant to either of those.

Pretty decent post overall.

Comment by johnswentworth on Natural Latents: The Math · 2024-02-27T17:42:09.673Z · LW · GW

Edited, thanks.

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-27T04:59:23.753Z · LW · GW

Third, the nontrivial prediction of 20 here is about "compactly describable errors. "Mislabelling a large part of the time (but not most of the time)" is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you'd have a meaningful boost in generalization error, but that doesn't happen. Easy Bayes update against #20. (And if we can't agree on this, I don't see what we can agree on.)

I indeed disagree with that, and I see two levels of mistake here. At the object level, there's a mistake of not thinking through the gears. At the epistemic level, it looks like you're trying to apply the "what would I have expected in advance?" technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)

First, object-level: let's walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that's how we usually train them). If we relabel 1's as 7's 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its "real underlying uncertainty", which we'd expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.

What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.

The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky's #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (... insofar as we should update at all from this experiment, which we shouldn't very much.)

Second, epistemic-level: my best guess is that you're ignoring these gears because they're not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias. 

Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. "Which things are we actually measuring?" is itself usually figured out (if it's figured out at all) by looking at data from the experiment.

Now, this is still compatible with using the "what would I have expected in advance?" technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is "this experiment will mostly measure some random-ass thing which has little to do with the model/theory I'm interested in, and I'll have to dig through the details of the experiment and results to figure out what it measured". If one tries to apply the "what would I have expected in advance?" technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.

  1. ^

    Standard disclaimer about guessing what's going on inside other peoples' heads being hard, you have more data than I on what's in your head, etc.

Comment by johnswentworth on And All the Shoggoths Merely Players · 2024-02-22T06:32:16.079Z · LW · GW

This one is somewhat more Wentworth-flavored than our previous Doomimirs.

Also, I'll write Doomimir's part unquoted this time, because I want to use quote blocks within it.

On to Doomimir!


We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So ... why doesn't that happen?

Let's start with this.

Short answer: because those aren't actually very effective ways to get high ratings, at least within the current capability regime.

Long version: presumably the labeller knows perfectly well that they're working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that... have you ever personally done an exercise where you try to convince someone to do something they don't want to do, or aren't supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner's job was to not open their fist, while the other partner's job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar - not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.

Point of that story: based on my own firsthand experience, if you're not actually going to pay someone right now, then it's far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.

Ultimately, our discussion is using "threats and bribes" as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.

Now, you could reasonably respond: "Isn't it kinda fishy that the supposed failures on which your claim rests are 'illegible'?"

To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:

The iterative design loop hasn't failed yet.

Now that's a very interesting claim. I ask: what do you think you know, and how do you think you know it?

Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the "observe/orient" steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don't necessarily know that it's failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don't know a problem is there, or don't orient toward the right thing, and therefore we don't iterate on the problem.

What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:

  • There is some widely-believed falsehood. The generative model might "know" the truth, from having trained on plenty of papers by actual experts, but the raters don't know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world's subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
  • There is some even-more-widely-believed falsehood, such that even the so-called "experts" haven't figured out yet that it's false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
  • Neither raters nor developers have time to check the models' citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
  • On various kinds of "which option should I pick" questions, there's an option which results in marginally more slave labor, or factory farming, or what have you - terrible things which a user might strongly prefer to avoid, but it's extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don't reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
    • This is the sort of problem which, in the high-capability regime, especially leads to "Potemkin village world".
  • On various kinds of "which option should I pick" questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don't even attempt to test which options are best long-term, they just read the LLM's response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
  • On various kinds of "which option should I pick" questions, there are things which sound fun or are marketed as fun, but which humans mostly don't actually enjoy (or don't enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)

... and so forth. The unifying theme here is that when these failures occur, it is not obvious that they've occurred.

This makes empirical study tricky - not impossible, but it's easy to be mislead by experimental procedures which don't actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:

They varied the size of the KL penalty of an LLM RLHF'd for a summarization task, and found about what you'd expect from the vague handwaving: as the KL penalty decreases, the reward model's predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...

(Bolding mine.) As you say, one could spin that as demonstrating "yet another portent of our impending deaths", but really this paper just isn't measuring the most relevant things in the first place. It's still using human ratings as the evaluation mechanism, so it's not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.

So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?

(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)

.

.

.

So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that's "better than" the labels used for training.

OpenAI's weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a "ground truth" which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven't put that much care into it.)

More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most "coming decoupled from reality" problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That's what tends to happen, in fields where people don't have a deep understanding of the systems they're working with.

And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that's just the experimentalist themselves - this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists' interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure - John Boyd, the guy who introduced the term "OODA loop", talked a lot about that sort of failure.)

To summarize the main points here:

  • Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (... and then things are hard.)
  • AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
  • So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.

Now, I would guess that your next question is: "But how does that lead to extinction?". That is one of the steps which has been least well-explained historically; someone with your "unexpectedly low polygenic scores" can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you... <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?

But I'll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.

Comment by johnswentworth on Monthly Roundup #15: February 2024 · 2024-02-21T05:15:47.633Z · LW · GW

Is the Waldo picture at the end supposed to be Holden, or is that accidental?