Posts
Comments
Another class of applications which we discussed at the retreat: person 1 takes the amnesic, person 2 shares private information on them, and then person gives their reaction to the private information. Can be used e.g. for complex negotiations: maybe it is in our mutual best interest to make some deal, but in order for me to know that I'd need some information which you don't want to share with me, so I take the drug, you share the information, and I record some verified record of myself saying "dear future self, you should in fact take this deal".
... which is cool in theory but I would guess not of high immediate value in practice, which is why the post didn't focus on it.
I would love to hear suggestions for other things I could try. If you have any, let me know in a comment!
Do you know what the drug was which did this?
Nitpick: you're talking about the discovery of the structure of DNA; it was already known at that time to be the particle which mediates inheritance IIRC.
I buy this argument.
I buy this argument.
I don't buy mathematical equivalence as an argument against, in this case, since the whole point of the path integral formulation is that it's mathematically equivalent but far simpler conceptually and computationally.
Man, that top one was a mess. Fixed now, thank you!
Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I've already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you're familiar with the history of any of these enough to say that they clearly were/weren't very counterfactual, please leave a comment.
- Noether's Theorem
- Mendel's Laws of Inheritance
- Godel's First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
- Feynman's path integral formulation of quantum mechanics
- Onnes' discovery of superconductivity
- Pauling's discovery of the alpha helix structure in proteins
- McClintock's work on transposons
- Observation of the cosmic microwave background
- Lorentz's work on deterministic chaos
- Prusiner's discovery of prions
- Yamanaka factors for inducing pluripotency
- Langmuir's adsorption isotherm (I have no idea what this is)
I somehow missed that John Wentworth and David Lorell are also in the middle of a sequence on this same topic here.
Yeah, uh... hopefully nobody's holding their breath waiting for the rest of that sequence. That was the original motivator, but we only wrote the one post and don't have any more in development yet.
Point is: please do write a good stat mech sequence, David and I are not really "on that ball" at the moment.
(Didn't read most of the dialogue, sorry if this was covered.)
But the way transformers work is they greedily think about the very next token, and predict that one, even if by conditioning on it you shot yourself in the foot for the task at hand.
That depends on how we sample from the LLM. If, at each "timestep", we take the most-probable token, then yes that's right.
But an LLM gives a distribution over tokens at each timestep, i.e. . If we sample from that distribution, rather than take the most-probable at each timestep, then that's equivalent to sampling non-greedily from the learned distribution over text. It's the chain rule:
Writing collaboratively is definitely something David and I have been trying to figure out how to do productively.
How sure are we that models will keeptracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don't have enough space and compute to actually track a distribution over latent states?
One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.
Yup, that was it, thankyou!
We're now working through understanding all the pieces of this, and we've calculated an MSP which doesn't quite look like the one in the post:
(Ignore the skew, David's still fiddling with the projection into 2D. The important noticeable part is the absence of "overlap" between the three copies of the main shape, compared to the fractal from the post.)
Specifically, each point in that visual corresponds to a distribution for some value of the observed symbols . The image itself is of the points on the probability simplex. From looking at a couple of Crutchfield papers, it sounds like that's what the MSP is supposed to be.
The update equations are:
with given by the transition probabilities, given by the observation probabilities, and a normalizer. We generate the image above by running initializing some random distribution , then iterating the equations and plotting each point.
Off the top of your head, any idea what might account for the mismatch (other than a bug in our code, which we're already checking)? Are we calculating the right thing, i.e. values of ? Are the transition and observation probabilities from the graphic in the post the same parameters used to generate the fractal? Is there some thing which people always forget to account for when calculating these things?
Can you elaborate on how the fractal is an artifact of how the data is visualized?
I don't know the details of the MSP, but my current understanding is that it's a general way of representing stochastic processes, and the MSP representation typically looks quite fractal. If we take two approximately-the-same stochastic processes, then they'll produce visually-similar fractals.
But the "fractal-ness" is mostly an artifact of the MSP as a representation-method IIUC; the stochastic process itself is not especially "naturally fractal".
(As I said I don't know the details of the MSP very well; my intuition here is instead coming from some background knowledge of where fractals which look like those often come from, specifically chaos games.)
That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.
A thing which is highly cruxy for me here, which I did not fully understand from the post: what exactly is the function which produces the fractal visual from the residual activations? My best guess from reading the post was that the activations are linearly regressed onto some kind of distribution, and then the distributions are represented in a particular way which makes smooth sets of distributions look fractal. If there's literally a linear projection of the residual stream into two dimensions which directly produces that fractal, with no further processing/transformation in between "linear projection" and "fractal", then I would change my mind about the fractal structure being mostly an artifact of the visualization method.
[EDIT: I no longer endorse this response, see thread.]
(This comment is mainly for people other than the authors.)
If your reaction to this post is "hot damn, look at that graph", then I think you should probably dial back your excitement somewhat. IIUC the fractal structure is largely an artifact of how the data is visualized, which means the results visually look more striking than they really are.
It is still a cool piece of work, and the visuals are beautiful. The correct amount of excitement is greater than zero.
Yup. Also, I'd add that entropy in this formulation increases exactly when more than one macrostate at time maps to the same actually-realized macrostate at time , i.e. when the macrostate evolution is not time-reversible.
This post was very specifically about a Boltzmann-style approach. I'd also generally consider the Gibbs/Shannon formula to be the "real" definition of entropy, and usually think of Boltzmann as the special case where the microstate distribution is constrained uniform. But a big point of this post was to be like "look, we can get surprisingly a lot (though not all) of thermo/stat mech even without actually bringing in any actual statistics, just restricting ourselves to the Boltzmann notion of entropy".
Meta: this comment is decidedly negative feedback, so needs the standard disclaimers. I don't know Ethan well, but I don't harbor any particular ill-will towards him. This comment is negative feedback about Ethan's skill in choosing projects in particular, I do not think others should mimic him in that department, but that does not mean that I think he's a bad person/researcher in general. I leave the comment mainly for the benefit of people who are not Ethan, so for Ethan: I am sorry for being not-nice to you here.
When I read the title, my first thought was "man, Ethan Perez sure is not someone I'd point to as an examplar of choosing good projects".
On reading the relevant section of the post, it sounds like Ethan's project-selection method is basically "forward-chain from what seems quick and easy, and also pay attention to whatever other people talk about". Which indeed sounds like a recipe for very mediocre projects: it's the sort of thing you'd expect a priori to reliably produce publications and be talked about, but have basically-zero counterfactual impact. These are the sorts of projects where someone else would likely have done something similar regardless, and it's not likely to change how people are thinking about things or building things; it's just generally going to add marginal effort to the prevailing milieu, whatever that might be.
From reading, I imagined a memory+cache structure instead of being closer to "cache all the way down".
Note that the things being cached are not things stored in memory elsewhere. Rather, they're (supposedly) outputs of costly-to-compute functions - e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than "from scratch" - e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.
Coherence of Caches and Agents goes into more detail on that part of the picture, if you're interested.
Very far through the graph representing the causal model, where we start from one or a few nodes representing the immediate observations.
You were talking about values and preferences in the previous paragraph, then suddenly switched to “beliefs”. Was that deliberate?
Yes.
... man, now that the post has been downvoted a bunch I feel bad for leaving such a snarky answer. It's a perfectly reasonable question, folks!
Overcompressed actual answer: core pieces of a standard doom-argument involve things like "killing all the humans will be very easy for a moderately-generally-smarter-than-human AI" and "killing all the humans (either as a subgoal or a side-effect of other things) is convergently instrumentally useful for the vast majority of terminal objectives". A standard doom counterargument usually doesn't dispute those two pieces (though there are of course exceptions); a standard doom counterargument usually argues that we'll have ample opportunity to iterate, and therefore it doesn't matter that the vast majority of terminal objectives instrumentally incentivize killing humans, we'll iterate until we find ways to avoid that sort of thing.
The standard core disagreement is then mostly about the extent to which we'll be able to iterate, or will in fact iterate in ways which actually help. In particular, cruxy subquestions tend to include:
- How visible will "bad behavior" be early on? Will there be "warning shots"? Will we have ways to detect unwanted internal structures?
- How sharply/suddenly will capabilities increase?
- Insofar as problems are visible, will labs and/or governments actually respond in useful ways?
Militarization isn't very centrally relevant to any of these; it's mostly relevant to things which are mostly not in doubt anyways, at least in the medium-to-long term.
Yes, I mean "mole" as in the unit from chemistry. I used it because I found it amusing.
Every algorithmic improvement is a one-time boost.
It doesn't.
Here's what it would typically look like in a control theory problem.
There's a long term utility which is a function of the final state , and a short term utility which is a function of time , the state at time , and the action at time . (Often the problem is formulated with a discount rate , but in this case we're allowing time-dependent short-term utility, so we can just absorb the discount rate into ). The objective is then to maximize
In that case, the value function is a max over trajectories starting at :
The key thing to notice is that we can solve that equation for :
So given an arbitrary value function , we can find a short-term utility function which produces that value function by using that equation to compute starting from the last timestep and working backwards.
Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
What if we restrict to only consider long-term utility, i.e. set ? Well, then the value function is no longer so arbitrary. That's the case considered in the post, where we have constraints which the value function must satisfy regardless of .
Did that clarify?
On the matter of software improvements potentially available during recursive self-improvement, we can look at the current pace of algorithmic improvement, which has been probably faster than scaling for some time now. So that's another lower bound on what AI will be capable of, assuming that the extrapolation holds up.
This is definitely a split which I think underlies a lot of differing intuitions about AGI and timelines. That said, the versions of each which are compatible with evidence/constraints generally have similar implications for at least the basics of AI risk (though they differ in predictions about what AI looks like "later on", once it's already far past eclipsing the capabilities of the human species).
Key relevant evidence/constraints, under my usual framing:
- We live in a very high dimensional environment. When doing science/optimization in such an environment, brute-force is search is exponentially intractable, so having e.g. ten billion humans running the same basic brute-force algorithm will not be qualitatively better than one human running a brute-force algorithm. The fact that less-than-exponentially-large numbers of humans are able to perform as well as we are implies that there's some real "general intelligence" going on in there somewhere.
- That said, it's still possible-in-principle for whatever general intelligence we have to be importantly distributed across humans. What the dimensionality argument rules out is a model in which humans' capabilities are just about brute-force trying lots of stuff, and then memetic spread of whatever works. The "trying stuff" step has to be doing "most of the work", in some sense, of finding good models/techniques/etc; but whatever process is doing that work could itself be load-bearingly spread across humans.
- Also, memetic spread could still be a bottleneck in practice, even if it's not "doing most of the work" in an algorithmic sense.
- A lower bound for what AI can do is "run lots of human-equivalent minds, and cheaply copy them". Even under a model where memetic spread is the main bottlenecking step for humans, AI will still be ridiculously better at that. You know that problem humans have where we spend tons of effort accumulating "tacit knowledge" which is hard to convey to the next generation? For AI, cheap copy means that problem is just completely gone.
- Humans' own historical progress/experience puts an upper bound on how hard it is to solve novel problems (not solved by society today). Humans have done... rather ridiculously a lot of that, over the past 250 years. That, in turn, lower bounds what AIs will be capable of.
Only if they both predictably painted that part purple, e.g. as part of the overall plan. If they both randomly happened to paint the same part purple, then no.
The main model I know of under which this matters much right now is: we're pretty close to AGI already, it's mostly a matter of figuring out the right scaffolding. Open-sourcing weights makes it a lot cheaper and easier for far more people to experiment with different scaffolding, thereby bringing AGI significantly closer in expectation. (As an example of someone who IIUC sees this as the mainline, I'd point to Connor Leahy.)
Sounds like I've maybe not communicated the thing about circularity. I'll try again, it would be useful to let me know whether or not this new explanation matches what you were already picturing from the previous one.
Let's think about circular definitions in terms of equations for a moment. We'll have two equations: one which "defines" in terms of , and one which "defines" in terms of :
Now, if , then (I claim) that's what we normally think of as a "circular definition". It's "pretending" to fully specify and , but in fact it doesn't, because one of the two equations is just a copy of the other equation but written differently. The practical problem, in this case, is that and are very underspecified by the supposed joint "definition".
But now suppose is not , and more generally the equations are not degenerate. Then our two equations are typically totally fine and useful, and indeed we use equations like this all the time in the sciences and they work great. Even though they're written in a "circular" way, they're substantively non-circular. (They might still allow for multiple solutions, but the solutions will typically at least be locally unique, so there's a discrete and typically relatively small set of solutions.)
That's the sort of thing which clustering algorithms do: they have some equations "defining" cluster-membership in terms of the data points and cluster parameters, and equations "defining" the cluster parameters in terms of the data points and the cluster-membership:
cluster_membership = (data, cluster_params)
cluster_params = (data, cluster_membership)
... where and are different (i.e. non-degenerate; is not just with data held constant). Together, these "definitions" specify a discrete and typically relatively small set of candidate (cluster_membership, cluster_params) values given some data.
That, I claim, is also part of what's going on with abstractions like "dog".
(Now, choice of axes is still a separate degree of freedom which has to be handled somehow. And that's where I expect the robustness to choice of axes does load-bearing work. As you say, that's separate from the circularity issue.)
As I mentioned at the end, it's not particularly relevant to my own models either way, so I don't particularly care. But I do think other people should want to run this experiment, based on their stated models.
That's only true if the Bellman equation in question allows for a "current payoff" at every timestep. That's the term which allows for totally arbitrary value functions, and not-coincidentally it's the term which does not reflect long-range goals/planning, just immediate payoff.
If we're interested in long-range goals/planning, then the natural thing to do is check how consistent the policy is with a Bellman equation without a payoff at each timestep - i.e. a value function just backpropagated from some goal at a much later time. That's what would make the check nontrivial: there exist policies which are not consistent with any assignment of values satisfying that Bellman equation. For example, the policy which chooses to transition from state A -> B with probability 1 over the option to stay at A with probability 1 (implying value B > value A for any values consistent with that policy), but also chooses to transition B -> A with probability 1 over the option to stay at B with probability 1 (implying value A > value B for any values consistent with that policy).
(There's still the trivial case where indifference could be interpreted as compatible with any policy, but that's easy to handle by adding a nontriviality requirement.)
I don't usually think about RL on MDPs, but it's an unusually easy setting in which to talk about coherence and its relationship to long-term-planning/goal-seeking/power-seeking.
Simplest starting point: suppose we're doing RL to learn a value function (i.e. mapping from states to values, or mapping from states x actions to values, whatever your preferred setup), with transition probabilities known. Well, in terms of optimal behavior, we know that the optimal value function for any objective in the far future will locally obey the Bellman equation with zero payoff in the immediate timestep: value of this state is equal to the max over actions of expected next-state value under that action. So insofar as we're interested in long-term goals specifically, there's an easy local check for the extent to which the value function "optimizes for" such long-term goals: just check how well it locally satisfies that Bellman equation.
From there, we can extend to gradually more complicated cases in ways which look similar to typical coherence theorems (like e.g. Dutch Book theorems). For instance, we could relax the requirement of known probabilities: we can ask whether there is any assignment of state-transition probabilities such that the values satisfy the Bellman equation.
As another example, if we're doing RL on a policy rather than value function, we can ask whether there exists any value function consistent with the policy such that the values satisfy the Bellman equation.
So that example SWE bench problem from the post:
... is that a prototypical problem from that benchmark? Because if so, that is a hilariously easy benchmark. Like, something could ace that task and still be coding at less than a CS 101 level.
(Though to be clear, people have repeatedly told me that a surprisingly high fraction of applicants for programming jobs can't do fizzbuzz, so even a very low level of competence would still put it above many would-be software engineers.)
Fixed, thanks.
Yeah, that's right.
The secret handshake is to start with " is independent of given " and " is independent of given ", expressed in this particular form:
... then we immediately see that for all such that .
So if there are no zero probabilities, then for all .
That, in turn, implies that takes on the same value for all Z, which in turn means that it's equal to . Thus and are independent. Likewise for and . Finally, we leverage independence of and given :
(A similar argument is in the middle of this post, along with a helpful-to-me visual.)
Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.
This is easiest to see if we use a "strong invariance" condition, in which each of the must mediate between and . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure () from any little spatially-localized chunk of the gas (). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e. only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.
I agree with this point as stated, but think the probability is more like 5% than 0.1%
Same.
I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.
Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"
That's not particularly cruxy for me either way.
Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".
Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.
Solid post!
I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.
Yup.
[EDIT April 5: I do not currently "have the ball" on this, so to anybody reading this who would go test it themselves if-and-only-if they don't see somebody else already on it: I am not on it.]
Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?
What was your old job?
Did you see the footnote I wrote on this? I give a further argument for it.
Ah yeah, I indeed missed that the first time through. I'd still say I don't buy it, but that's a more complicated discussion, and it is at least a decent argument.
I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.
This is another place where I'd say we don't understand it well enough to give a good formal definition or operationalization yet.
Though I'd note here, and also above w.r.t. search, that "we don't know how to give a good formal definition yet" is very different from "there is no good formal definition" or "the underlying intuitive concept is confused" or "we can't effectively study the concept at all" or "arguments which rely on this concept are necessarily wrong/uninformative". Every scientific field was pre-formal/pre-paradigmatic once.
To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.
That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.
Man, that is one hell of a bullet to bite. Much kudos for intellectual bravery and chutzpah!
That might be a fun topic for a longer discussion at some point, though not right now.
I would like to see a much more rigorous definition of "search" and why search would actually be "compressive" in the relevant sense for NN inductive biases. My current take is something like "a lot of the references to internal search on LW are just incoherent" and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward "search" in ways that are totally benign.
More generally, I'm quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there's a double dissociation between these things, where "mechanistic search" is almost always benign, and grabby consequentialism need not be backed by mechanistic search.
Some notes on this:
- I don't think general-purpose search is sufficiently well-understood yet to give a rigorous mechanistic definition. (Well, unless one just gives a very wrong definition.)
- Likewise, I don't think we understand either search or NN biases well enough yet to make a formal compression argument. Indeed, that sounds like a roughly-agent-foundations-complete problem.
- I'm pretty skeptical that internal general-purpose search is compressive in current architectures. (And this is one reason why I expect most AI x-risk to come from importantly-different future architectures.) Low confidence, though.
- Also, current architectures do have at least some "externalized" general-purpose search capabilities, insofar as they can mimic the "unrolled" search process of a human or group of humans thinking out loud. That general-purpose search process is basically AgentGPT. Notably, it doesn't work very well to date.
- Insofar as I need a working not-very-formal definition of general-purpose search, I usually use a behavioral definition: a system which can take in a representation of a problem in some fairly-broad class of problems (typically in a ~fixed environment), and solve it.
- The argument that a system which satisfies that behavioral definition will tend to also have an "explicit search-architecture", in some sense, comes from the recursive nature of problems. E.g. humans solve large novel problems by breaking them into subproblems, and then doing their general-purpose search/problem-solving on the subproblems; that's an explicit search architecture.
- I definitely agree that grabby consequentialism need not be backed by mechanistic search. More skeptical of the claim mechanistic search is usually benign, at least if by "mechanistic search" we mean general-purpose search (though I'd agree with a version of this which talks about a weaker notion of "search").
Also, one maybe relevant deeper point, since you seem familiar with some of the philosophical literature: IIUC the most popular way philosophers ground semantics is in the role played by some symbol/signal in the evolutionary environment. I view this approach as a sort of placeholder: it's definitely not the "right" way to ground semantics, but philosophy as a field is using it as a stand-in until people work out better models of grounding (regardless of whether the philosophers themselves know that they're doing so). This is potentially relevant to the "representation of a problem" part of general-purpose search.
I'm curious which parts of the Goal Realism section you find "philosophically confused," because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.
(I'll briefly comment on each section, feel free to double-click.)
Against Goal Realism: Huemer... indeed seems confused about all sorts of things, and I wouldn't consider either the "goal realism" or "goal reductionism" picture solid grounds for use of an indifference principle (not sure if we agree on that?). Separately, "reductionism as a general philosophical thesis" does not imply the thing you call "goal reductionism" - for instance one could reduce "goals" to some internal mechanistic thing, rather than thinking about "goals" behaviorally, and that would be just as valid for the general philosophical/scientific project of reductionism. (Not that I necessarily think that's the right way to do it.)
Goal Slots Are Expensive: just because it's "generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules" doesn't mean the end-to-end trained system will turn out non-modular. Biological organisms were trained end-to-end by evolution, yet they ended up very modular.
Inner Goals Would Be Irrelevant: I think the point this section was trying to make is something I'd classify as a pointer problem? I.e. the internal symbolic "goal" does not necessarily neatly correspond to anything in the environment at all. If that was the point, then I'm basically on-board, though I would mention that I'd expect evolution/SGD/cultural evolution/within-lifetime learning/etc to drive the internal symbolic "goal" to roughly match natural structures in the world. (Where "natural structures" cashes out in terms of natural latents, but that's a whole other conversation.)
Goal Realism Is Anti-Darwinian: Fodor obviously is deeply confused, but I think you've misdiagnosed what he's confused about. "The physical world has no room for goals with precise contents" is somewhere between wrong and a nonsequitur, depending on how we interpret the claim. "The problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter" is correct, but very incomplete as a response to Fodor.
Goal Reductionism Is Powerful: While most of this section sounds basically-correct as written, the last few sentences seem to be basically arguing for behaviorism for LLMs. There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.
This isn't a proper response to the post, but since I've occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:
- This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
- There does exist a version of the counting argument which basically works.
- The version which works routes through compression and/or singular learning theory.
- In particular, that version would talk about "goal-slots" (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the "counting argument for overfitting" from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
- Just remembered I walked through basically the good version of the counting argument in this section of What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
- The "Against Goal Realism" section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it's making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don't provide much direct evidence relevant to either of those.
Pretty decent post overall.
Edited, thanks.