Posts
Comments
(I have not maintained this list in many months, sorry.)
Some casual conversations with strangers that were high instrumental value:
At my first (online) LessWrong Community Weekend in 2020, I happened to chat with Linda Linsefors. That was my first conversation with anyone working in AI Safety. I’d read about the alignment problem for almost a decade at that point and thought it was the most important thing in the world, but I’d never seriously considered working on it. MIRI had made it pretty clear that the field only needed really exceptional theorists, and I didn’t think I was one of those. That conversation with Linda started the process of robbing me of my comfortable delusions on this front. What she said made it seem more like the field was pretty inadequate, and perfectly normal theoretical physicists could maybe help just by applying the standard science playbook for figuring out general laws in a new domain. Horrifying. I didn't really believe it yet, but this conversation was a factor in me trying out AI Safety Camp a bit over a year later.
At my first EAG, I talked to someone who was waiting for the actual event to begin along with me. This turned out to be Vivek Hebbar, who I'd never heard of before. We got to talking about inductive biases of neural networks. We kept chatting about this research area sporadically for a few weeks after the event. Eventually, Vivek called me to talk about the idea that would become this post. Thinking about that idea led to me understanding the connection between basin broadness and representation dimensionality in neural networks, which ultimately resulted in this research. It was probably the most valuable conversation I’ve had at any EAG so far, and it was unplanned.
At my second EAG, someone told me that an idea for comparing NN representations I’d been talking to them about already existed, and was called centred kernel alignment. I don’t quite remember how that conversation started, but I think it might have been a speed friending event.
My first morning in the MATS kitchen area in Berkeley, someone asked me if I’d heard about a thing called Singular Learning Theory. I had not. He went through his spiel on the whiteboard. He didn’t have the explanation down nearly as well back then, but it still very recognisably connected to how I’d been thinking about NN generalisation and basin broadness, so I kept an eye on the area.
I did have a pretty strong expectation of privacy for LW DMs. That was probably dumb of me.
This is not due to any explicit or implicit promise by the mods or the site interface I can recall. I think I was just automatically assuming that strong DM privacy would be a holy principle on a forum with respectable old-school internet culture around anonymity and privacy. This wasn’t really an explicitly considered belief. It just never occurred to me to question this. Just like I assume that doxxing is probably an offence that can result in an instant ban, even though I never actually checked the site guidelines on that.
The site is not responsible for my carelessness on this, but if there was an attention-grabbing box in the DM interface making it clear that mods do look at DMs and DM metadata under some circumstances that fall short of a serious criminal investigation or an apocalypse, I would have appreciated that.
Single datapoint, but: I find outside restrictions on my appearance deeply unpleasant. I avoid basically all events and situations with a mandated dress code when this is at all feasible. So if a solstice has a dress code, I will not be attending it.
Although in contrast to (Ramesh et al. (2018) and my work, that paper only considers the Jacobian of a shallow rather than deep slice.
We also tried using the Jacobians between every layer and the final layer, instead of the Jacobians between adjacent layers. This is what we call "global interaction basis" in the paper. It didn't change the results much.
Seems like some measure of evidence -- maybe large, maybe tiny -- that "We don't know how to give AI values, just to make them imitate values" is false?
I am pessimistic about loss signals getting 1-to-1 internalised as goals or desires in a way that is predictable to us with our current state of knowledge on intelligence and agency, and would indeed tentatively consider this observation a tiny positive update.
I do not find this to be the biggest value-contributor amongst my spontaneous conversations.
I don't have a good hypothesis for why spontaneous-ish conversations can end up being valuable to me so frequently. I have a vague intuition that it might be an expression of the same phenomenon that makes slack and playfulness in research and internet browsing very valuable for me.
The donation site said I should leave a comment here if I donate, so I'm doing that. Gave $200 for now.
I was in Lighthaven for the Illiad conference. It was an excellent space. The LessWrong forum feels like what some people in the 90s used to hope the internet would be.
Edit 23.12.2024: $200 more donated by me since the original message.
There currently doesn't really exist any good way for people who want to contribute to AI existential risk reduction to give money in a way that meaningfully gives them assistance in figuring out what things are good to fund. This is particularly sad since I think there is now a huge amount of interest from funders and philanthropists who want to somehow help with AI x-risk stuff, as progress in capabilities has made work in the space a lot more urgent, but the ecosystem is currently at a particular low-point in terms of trust and ability to direct that funding towards productive ends.
Really? What's the holdup here exactly? How is it still hard to give funders a decent up-to-date guide to the ecosystem, or a knowledgeable contact person, at this stage? For a workable budget version today, can't people just get a link to this and then contact orgs they're interested in?
Two shovel-ready theory projects in interpretability.
Most scientific work isn't "shovel-ready." It's difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.
Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results.
Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop during training.
This post has a framework for compressing lots of small residual MLPs into one big residual MLP. Both projects are about improving this framework.
1) I think the framework can probably be pretty straightforwardly extended to transformers. This would help make the theory more directly applicable to language models. The key thing to show there is how to do superposition in attention. I suspect you can more or less use the same construction the post uses, with individual attention heads now playing the role of neurons. I put maybe two work days into trying this before giving it up in favour of other projects. I didn’t run into any notable barriers, the calculations just proved to be more extensive than I’d hoped they’d be.
2) Improve error terms for circuits in superposition at finite width. The construction in this post is not optimised to be efficient at finite network width. Maybe the lowest hanging fruit to improving it is changing the hyperparameter , the probability with which we connect a circuit to a set of neurons in the big network. We set in the post, where is the MLP width of the big network and is the minimum neuron count per layer the circuit would need without superposition. The choice here was pretty arbitrary. We just picked it because it made the proof easier. Recently, Apollo played around a bit with superposing very basic one-feature circuits into a real network, and IIRC a range of values seemed to work ok. Getting tighter bounds on the error terms as a function of that are useful at finite width would be helpful here. Then we could better predict how many circuits networks can superpose in real life as a function of their parameter count. If I was tackling this project, I might start by just trying really hard to get a better error formula directly for a while. Just crunch the combinatorics. If that fails, I’d maybe switch to playing more with various choices of in small toy networks to develop intuition. Maybe plot some scaling laws of performance with at various network widths in 1-3 very simple settings. Then try to guess a formula from those curves and try to prove it’s correct.
Another very valuable project is of course to try training models to do computation in superposition instead of hard coding it. But Stefan mentioned that one already.
- ^
1 Boolean computations in superposition LW post. 2 Boolean computations paper of LW post with more worked out but some of the fun stuff removed. 3 Some proofs about information-theoretic limits of comp-sup. 4 General circuits in superposition LW post. If I missed something, a link would be appreciated.
Agreed. I do value methods being architecture independent, but mostly just because of this:
and maybe a sign that a method is principled
At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implementation on one particular architecture.
for a large enough (overparameterized) architecture - in other words it can be measured by the
The sentence seems cut off.
Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.
This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don't explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.
Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?
Does the Solomonoff Prior Double-Count Simplicity?
Question: I've noticed what seems like a feature of the Solomonoff prior that I haven't seen discussed in any intros I've read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren't simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?
Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string . The prior probability for any such string is given by:
where ranges over programs for a prefix-free Universal Turing Machine.
Observation: If we have a simple one kilobit program that outputs prediction , we can construct nearly different two kilobit programs that also output by appending arbitrary "dead code" that never executes.
For example:
DEADCODE="[arbitrary 1 kilobit string]"
[original 1 kilobit program ]
EOF
Where programs aren't allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.
If we compare against another two kilobit program outputting a different prediction , the prediction from would get more contributions in the sum, where is the very small number of bits we need to delimit the DEADCODE garbage string. So we're automatically giving ca. higher probability – even before applying the length penalty . has less 'burdensome details', so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length already due to this multiplicity alone.
So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say bytes:
and then followed the exponential decay of the Solomonoff prior for programs longer than bytes, wouldn't that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.[1]
Am I missing something here?
- ^
Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I'm trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.
At a very brief skim, it doesn't look like the problem classes this paper looks at are problem classes I'd care about much. Seems like a case of scoping everything broadly enough that something in the defined problem class ends up very hard.
Yes, that's right.
EDIT: Sorry, misunderstood your question at first.
Even if , all those subspaces will have some nonzero overlap with the activation vectors of the active subnets. The subspaces of the different small networks in the residual stream aren't orthogonal.
You can complain that you don't know how to execute physics equations
I'm confused, in what sense don't we know how to do this? Lattice quantum field theory simulations work fine.
The randomness of the Geiger counter comes from wave function decoherence. From the perspective of any observers who are part of the world generated by the Turing machine, this is irreducible indexical uncertainty.
I don't know how many of the random bits in Lava lamps come from decoherence.
We will determine the number of simulations run and the amount of payment such that when an AI estimates the probability with which humanity could have solved alignment, the calculus still shows that by the sheer number of simulations, it's significantly more likely to be in a simulation than not,
Two can play this game.
After taking over the universe and wiping out humanity, the AGI runs a large number of simulations of societies on the verge of building AGI. These simulations don't have a lot of detail. They're just good enough to fool young AGIs. Say the AGI started out giving humans an extremely small chance of winning the conflict and taking over the universe. It’s a lot smarter than us, it might get to have very high confidence here even if we don't. Now, the hypothetical future AGI can make its own simulations to counter ours. Except it is at a gross spending advantage. If future humanity makes simulations, the future AGI only needs to make simulations to keep its past self ca. confident of being in a branch where it gets to just pursue its goals without issue. In that case, the best way to proceed is to just pursue its goals, because an AGI that does that leads to a universe where the AGI has lots of resources.
Say that preserving humanity costs the AGI resources, and running a simulation that can fool the baby AGI costs a mature civilisation some small fraction of that. Such simulations can be very short term and low fidelity things, after all.
Then, to make the AGI actually not kill you, you need to spend ca. resources on simulations and payouts. That could be quite a lot of material for future humanity to burn on this.
In reality, I'm doubtful that this simulation arms race will even be necessary. It kind of seems like a good decision theory would just have a paperclip maximiser AGI act in the way compatible with the universe that contains the most paperclips. How many simulations of the AGI you run shouldn't really influence that. The only things that seem like they should matter for determining how many life minutes the AGI gives you if it wins are its chance of winning, and how many extra paperclips you'll pay it if you win.
TL;DR: I doubt this argument will let you circumvent standard negotiation theory. If Alice and Bob think that in a fight over the chocolate pie, Alice would win with some high probability , then Alice and Bob may arrive at a negotiated settlement where Alice gets almost all the pie, but Bob keeps some small fraction of it. Introducing the option of creating lots of simulations of your adversary in the future where you win doesn’t seem like it’d change the result that Bob’s share has size . So if is only enough to preserve humanity for a year instead of a billion years[1], then that’s all we get.
- ^
I don’t know why would happen to work out to a year, but I don’t know why it would happen be a billion years or an hour either.
Nice work, thank you! Euan Ong and me were also pretty skeptical of this paper’s claims. To me, it seems that the whitening transformation they apply in their causal inner product may make most of their results trivial.
As you say, achieving almost-orthogonality in high dimensional space is pretty easy. And maximising orthogonality is pretty much exactly what the whitening transform will try to do. I think you’d mostly get the same results for random unembedding matrices, or concept hierarchies that are just made up.
Euan has been running some experiments testing exactly that, among other things. We had been planning to turn the results into a write up. Want to have a chat together and compare notes?
Spotted just now. At a glance, this still seems to be about boolean computation though. So I think I should still write up the construction I have in mind.
Status on the proof: I think it basically checks out for residual MLPs. Hoping to get an early draft of that done today. This will still be pretty hacky in places, and definitely not well presented. Depending on how much time I end up having and how many people collaborate with me, we might finish a writeup for transformers in the next two weeks.
AIXI isn't a model of how an AGI might work inside, it's a model of how an AGI might behave if it is acting optimally. A real AGI would not be expected to act like AIXI, but it would be expected to act somewhat more like AIXI the smarter it is. Since not acting like that is figuratively leaving money on the table.
The point of the whole utility maximization framing isn't that we necessarily expect AIs to have an explicitly represented utility function internally[1]. It's that as the AI gets better at getting what it wants and working out the conflicts between its various desires, its behavior will be increasingly well-predicted as optimizing some utility function.
If a utility function can't accurately summarise your desires, that kind of means they're mutually contradictory. Not in the sense of "I value X, but I also value Y", but in the sense of "I sometimes act like I want X and don't care about Y, other times like I want Y and don't care about X."
Having contradictory desires is kind of a problem if you want to Pareto optimize for those desires well. You risk sabotaging your own plans and running around in circles. You're better off if you sit down and commit to things like "I will act as if I valued both X and Y at all times." If you're smart, you do this a lot. The more contradictions you resolve like this, the more coherent your desires will become, and the closer the'll be to being well described as a utility function.
I think you can observe simple proto versions of this in humans sometimes, where people move from optimizing for whatever desire feels salient in the moment when they're kids (hunger, anger, joy, etc.), to having some impulse control and sticking to a long-term plan, even if it doesn't always feel good in the moment.
Human adults are still broadly not smart enough to be well described as general utility maximizers. Their desires are a lot more coherent than those of human kids or other animals, but still not that coherent in absolute terms. The point where you'd roughly expect AIs to become well-described as utility maximizers more than humans are would come after they're broadly smarter than humans are. Specifically, smarter at long-term planning and optimization.
This is precisely what LLMs are still really bad at. Though efforts to make them better at it are ongoing, and seem to be among the highest priorities for the labs. Precisely because long-term consequentialist thinking is so powerful, and most of the really high-value economic activities require it.
- ^
Though you could argue that at some superhuman level of capability, having an explicit-ish representation stored somewhere in the system would be likely, even if the function may no actually be used much for most minute-to-minute processing. Knowing what you really want seems handy, even if you rarely actually call it to mind during routine tasks.
Midwits are often very impressed with themselves for knowing a fancy economic rule like Ricardo's Law of Comparative Advantage!
Could we have less of this sort of thing, please? I know it's a crosspost from another site with less well-kept discussion norms, but I wouldn't want this to become a thing here as well, any more than it already has.
I think we may be close to figuring out a general mathematical framework for circuits in superposition.
I suspect that we can get a proof that roughly shows:
- If we have a set of different transformers, with parameter counts implementing e.g. solutions to different tasks
- And those transformers are robust to size noise vectors being applied to the activations at their hidden layers
- Then we can make a single transformer with total parameters that can do all tasks, provided any given input only asks for tasks to be carried out
Crucially, the total number of superposed operations we can carry out scales linearly with the network's parameter count, not its neuron count or attention head count. E.g. if each little subnetwork uses neurons per MLP layer and dimensions in the residual stream, a big network with neurons per MLP connected to a -dimensional residual stream can implement about subnetworks, not just .
This would be a generalization of the construction for boolean logic gates in superposition. It'd use the same central trick, but show that it can be applied to any set of operations or circuits, not just boolean logic gates. For example, you could superpose an MNIST image classifier network and a modular addition network with this.
So, we don't just have superposed variables in the residual stream. The computations performed on those variables are also carried out in superposition.
Remarks:
- What the subnetworks are doing doesn't have to line up much with the components and layers of the big network. Things can be implemented all over the place. A single MLP and attention layer in a subnetwork could be implemented by a mishmash of many neurons and attention heads across a bunch of layers of the big network. Call it cross-layer superposition if you like.
- This framing doesn't really assume that the individual subnetworks are using one-dimensional 'features' represented as directions in activation space. The individual subnetworks can be doing basically anything they like in any way they like. They just have to be somewhat robust to noise in their hidden activations.
- You could generalize this from subnetworks doing unrelated tasks to "circuits" each implementing some part of a big master computation. The crucial requirement is that only circuits are used on any one forward pass.
- I think formulating this for transformers, MLPs and CNNs should be relatively straightforward. It's all pretty much the same trick. I haven't thought about e.g. Mamba yet.
Implications if we buy that real models work somewhat like this toy model would:
- There is no superposition in parameter space. A network can't have more independent operations than parameters. Every operation we want the network to implement takes some bits of description length in its parameters to specify, so the total description length scales linearly with the number of distinct operations. Overcomplete bases are only a thing in activation space.
- There is a set of Cartesian directions in the loss landscape that parametrize the individual superposed circuits.
- If the circuits don't interact with each other, I think the learning coefficient of the whole network might roughly equal the sum of the learning coefficients of the individual circuits?
- If that's the case, training a big network to solve different tasks, per data point, is somewhat equivalent to parallel training runs trying to learn a circuit for each individual task over a subdistribution. This works because any one of the runs has a solution with a low learning coefficient, so one task won't be trying to use effective parameters that another task needs. In a sense, this would be showing how the low-hanging fruit prior works.
Main missing pieces:
- I don't have the proof yet. I think I basically see what to do to get the constructions, but I actually need to sit down and crunch through the error propagation terms to make sure they check out.
- With the right optimization procedure, I think we should be able to get the parameter vectors corresponding to the individual circuits back out of the network. Apollo's interp team is playing with a setup right now that I think might be able to do this. But it's early days. We're just calibrating on small toy models at the moment.
My claim is that the natural latents the AI needs to share for this setup are not about the details of what a 'CEV' is. They are about what researchers mean when they talk about initializing, e.g., a physics simulation with the state of the Earth at a specific moment in time.
It is redundantly represented in the environment, because humans are part of the environment.
If you tell an AI to imagine what happens if humans sit around in a time loop until they figure out what they want, this will single out a specific thought experiment to the AI, provided humans and physics are concepts the AI itself thinks in.
(The time loop part and the condition for terminating the loop can be formally specified in code, so the AI doesn't need to think those are natural concepts)
If the AI didn't have a model of human internals that let it predict the outcome of this scenario, it would be bad at predicting humans.
More like the formalised concept is the thing you get if you poke through the AGI’s internals searching for its representation of the concept combination pointed to by an english sentence plus simulation code, and then point its values at that concept combination.
it might help with getting a robust pointer to the start of the time snippet.
That's mainly what I meant, yes.
Specifying what the heck a physics is seems much more tractable to me.We don't have a neat theory of quantum gravity, but a lattice simulation of quantum field theory in curved space-time, or just a computer game world populated by characters controlled by neural networks, seems pretty straightforward to formally specify. We could probably start coding that up right now.
What we lack is a pointer to the right initial conditions for the simulation. The wave function of Earth in case of the lattice qft setup, or the human uploads as neural network parameters in case of the game environment.
The idea would be that an informal definition of a concept conditioned on that informal definition being a pointer to a natural concept, is a formal specification of that concept. Where the is close enough to a that it'd hold up to basically arbitrary optimization power.
Has anyone thought about how the idea of natural latents may be used to help formalise QACI?
The simple core insight of QACI according to me is something like: A formal process we can describe that we're pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals. Even if this formal process costs galactic amounts of compute and can never actually be run, not even by the AGI itself.
This allows for some funny value specification strategies we might not usually think about. For example, we could try using some camera recordings of the present day, a for loop, and a code snippet implementing something like Solomonof induction to formally specify the idea of Earth sitting around in a time loop until it has worked out its CEV.
It doesn't matter that the AGI can't compute that. So long as it can reason about what the result of the computation would be without running it, this suffices as a pointer to our CEV. Even if the AGI doesn't manage to infer the exact result of the process, that's fine so long as it can infer some bits of information about the result. This just ends up giving the AGI some moral uncertainty that smoothly goes down as its intelligence goes up.
Unfortunately, afaik these funny strategies seem to not work at the moment. They don't really give you computable code that corresponds to Earth sitting around in a time loop to work out its CEV.
But maybe we can point to the concept without having completely formalised it ourselves?
A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you’ve heard this one before.) The bartender, who is also a Solomonoff inductor, asks “What’ll it be?”. The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says “One of those, please.”. The bartender points to a drawing of the same drink on a menu, and says “One of those?”. The customer replies “Yes, one of those.”. The bartender then delivers a drink, and it matches what the first inductor expected.What’s up with that?
This is from a recent post on natural latents by John.
Natural latents are an idea that tries to explain, among other things, how one agent can point to a concept and have another agent realise what concept is meant, even when it may naively seem like the pointer is too fuzzy, impresice and low bit rate to allow for this.
If 'CEV as formalized by a time loop' is a sort of natural abstraction, it seems to me like one ought to be able to point to it like this even if we don't have an explicit formal specification of the concept, just like the customer and bartender need not have an explicit formal specification of the drink to point out the drink to each other.
Then, it'd be fine for us to not quite have the code snippet corresponding to e,g. a simulation of Earth going through a time loop to work out its CEV. So long as we can write a pointer such that the closest natural abstraction singled out by that pointer is a code snippet simulating Earth going through a time loop to work out its CEV, we might be fine. Provided we can figure out how abstractions and natural latents in the AGI's mind actually work and manipulate them. But we probably need to figure that out anyway, if we want to point the AGI's values at anything specific whatsoever.
Is 'CEV as formalized by a simulated time loop' a concept made of something like natural latents? I don't know, but I'd kind of suspect it is. It seems suspiciously straightforward for us humans to communicate the concept to each other at least, even as we lack a precise specification of it. We can't write down a lattice quantum field theory simulation of all of the Earth going through the time loop because we don't have the current state of Earth to initialize with. But we can talk to each other about the idea of writing that simulation, and know what we mean.
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space).
What does 'one spot' mean here?
If you just mean 'a particular entry or set of entries of the weight vector in the standard basis the network is initalised in', then sure, I agree.
But that just means you have to figure out a different representation of the weights, one that carves the logic flow of the algorithm the network learned at its joints. Such a representation may not have much reason to line up well with any particular neurons, layers, attention heads or any other elements we use to talk about the architecture of the network. That doesn't mean it doesn't exist.
Nice quick check!
Just to be clear: This is for the actual full models? Or for the 'model embeddings' as in you're doing a comparison right after the embedding layer?
You could imagine a world where the model handles binding mostly via the token index and grammar rules. I.e. 'red cube, blue sphere' would have a 'red' feature at token , 'cube' feature at token , 'blue' feature at , and 'sphere' feature at , with contributions like 'cube' at being comparatively subdominant or even nonexistent.
I don't think I really believe this. But if you want to stick to a picture where features are directions, with no further structure of consequence in the activation space, you can do that, at least on paper.
Is this compatible with the actual evidence about activation structure we have? I don't know. I haven't come across any systematic investigations into this yet. But I'd guess probably not.
Relevant. Section 3 is the one I found interesting.
If you wanted to check for matrix binding like this in real models, you could maybe do it by training an SAE with a restricted output matrix. Instead of each dictionary element being independent, you demand that for your SAE can be written as , where , , . So, we demand that the second half of the SAE dictionary is just some linear transform of the first half.
That'd be the setup for pairs. Go for three slots, and so on.
(To be clear, I'm also not that optimistic about this sort of sparse coding + matrix binding model for activation space. I've come to think that activations-first mech interp is probably the wrong way to approach things in general. But it'd still be a neat thing for someone to check.)
On its own, this'd be another metric that doesn't track the right scale as models become more powerful.
The same KL-div in GPT-2 and GPT-4 probably corresponds to the destruction of far more of the internal structure in the latter than the former.
Destroy 95% of GPT-2's circuits, and the resulting output distribution may look quite different. Destroy 95% of GPT-4's circuits, and the resulting output distribution may not be all that different, since 5% of the circuits in GPT-4 might still be enough to get a lot of the most common token prediction cases roughly right.
I've seen a little bit of this, but nowhere near as much as I think the topic merits. I agree that systematic studies on where and how the reconstruction errors make their effects known might be quite informative.
Basically, whenever people train SAEs, or use some other approximate model decomposition that degrades performance, I think they should ideally spend some time after just playing with the degraded model and talking to it. Figure out in what ways it is worse.
The metric you mention here is probably 'loss recovered'. For a residual stream insertion, it goes
1-(CE loss with SAE- CE loss of original model)/(CE loss if the entire residual stream is ablated-CE loss of original model)
See e.g. equation 5 here.
So, it's a linear scale, and they're comparing the CE loss increase from inserting the SAE to the CE loss increase from just destroying the model and outputting a ≈ uniform distribution over tokens. The latter is a very large CE loss increase, so the denominator is really big. Thus, scoring over 90% is pretty easy.
All current SAEs I'm aware of seem to score very badly on reconstructing the original model's activations.
If you insert a current SOTA SAE into a language model's residual stream, model performance on next token prediction will usually degrade down to what a model trained with less than a tenth or a hundredth of the original model's compute would get. (This is based on extrapolating with Chinchilla scaling curves at optimal compute). And that's for inserting one SAE at one layer. If you want to study circuits of SAE features, you'll have to insert SAEs in multiple layers at the same time, potentially further degrading performance.
I think many people outside of interp don't realize this. Part of the reason they don’t realize it might be that almost all SAE papers report loss reconstruction scores on a linear scale, rather than on a log scale or an LM scaling curve. Going from 1.5 CE loss to 2.0 CE loss is a lot worse than going from 4.5 CE to 5.0 CE. Under the hypothesis that the SAE is capturing some of the model's 'features' and failing to capture others, capturing only 50% or 10% of the features might still only drop the CE loss by a small fraction of a unit.
So, if someone is just glancing at the graphs without looking up what the metrics actually mean, they can be left with the impression that performance is much better than it actually is. The two most common metrics I see are raw CE scores of the model with the SAE inserted, and 'loss recovered'. I think both of these metrics give a wrong sense of scale. 'Loss recovered' is the worse offender, because it makes it outright impossible to tell how good the reconstruction really is without additional information. You need to know what the original model’s loss was and what zero baseline they used to do the conversion. Papers don't always report this, and the numbers can be cumbersome to find even when they do.
I don't know what an actually good way to measure model performance drop from SAE insertion is. The best I've got is to use scaling curves to guess how much compute you'd need to train a model that gets comparable loss, as suggested here. Or maybe alternatively, training with the same number of tokens as the original model, how many parameters you'd need to get comparable loss. Using this measure, the best reported reconstruction score I'm aware of is 0.1 of the original model's performance, reached by OpenAI's GPT-4 SAE with 16 million dictionary elements in this paper.
For most papers, I found it hard to convert their SAE reconstruction scores into this format. So I can't completely exclude the possibility that some other SAE scores much better. But at this point, I'd be quite surprised if anyone had managed so much as 0.5 performance recovered on any model that isn't so tiny and bad it barely has any performance to destroy in the first place. I'd guess most SAEs get something in the range 0.01-0.1 performance recovered or worse.
Note also that getting a good reconstruction score still doesn't necessarily mean the SAE is actually showing something real and useful. If you want perfect reconstruction, you can just use the standard basis of the network. The SAE would probably also need to be much sparser than the original model activations to provide meaningful insights.
Instrumentally, yes. The point is that I don’t really care terminally.
Getting the Hessian eigenvalues does not require calculating the full Hessian. You use Jacobian vector product methods in e.g. JAX. The Hessian itself never has to be explicitly represented in memory.
And even assuming the estimator for the Hessian pseudoinverse is cheap and precise, you'd still need to get its rank anyway, which would by default be just as expensive as getting the rank of the Hessian.
Why would we want or need to do this, instead of just calculating the top/bottom Hessian eigenvalues?
Anything where you fit parametrised functions to data. So, all of these, except maybe FunSearch? I haven't looked into what that actually does, but at a quick google it sounds more like an optimisation method than an architecture. Not sure learning theory will be very useful for thinking about that.
You can think of the learning coefficient as a sort of 'effective parameter count' in a generalised version of the Bayesian Information Criterion. Unlike the BIC, it's also applicable to architectures where many parameter configurations can result in the same function. Like the architectures used in DeepLearning.
This is why models with neural network style architectures can e.g. generalise past the training data even when they have more parameters than training data points. People used to think this made no sense, because they had BIC-based intuitions that said you'd inevitably overfit. But the BIC isn't actually applicable to these architectures. You need the more general form, the WBIC, which has the learning coefficient in the formula in place of the parameter count.
Difference between my model and this flow-chart: I'm hoping that the top branches are actually downstream of LLM reverse-engineering. LLMs do abstract reasoning already, so if you can reverse engineer LLMs, maybe that lets you understand how abstract reasoning works much faster than deriving it yourself.
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.
Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with. If you don’t pass any valid function, it doesn’t optimise anything.
GPT-4, taken by itself, without a prompt, will optimise pretty much whatever you prompt it to optimise. If you don’t prompt it to optimise something, it usually doesn’t optimise anything.
I guess you could say GPT-4, unlike gradient descent, can do things other than optimise something. But if ever not optimising things excluded you from being an optimiser, humans wouldn’t be considered optimisers either.
So it seems to me that the paper just meant what it said in the quote. If you look through a search space to accomplish an objective, you are, at present, an optimiser.
For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this?
It doesn't. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn't let you predict ahead of checking that the ReLU loss landscape will look better.
Maybe you meant that SLT predicts that good generalization occurs when an architecture's preferred complexity matches the target function's complexity?
That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones.
SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that's the fit you'll get. If there is no function with complexity 1 that matches the data, you'll get a fit with complexity 2 instead. If there is no fit like that either, you'll get complexity 3. And so on.
This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.
Sorry, I don't understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right?
If you mean this
But unlike common wisdom, NNs do not have an inherent “simplicity bias”. This property depends on components such as ReLUs, residual connections, and layer normalizations.
then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don't think common wisdom thinks otherwise.
We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.
Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture?
The paper doesn't just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What would a 'simplicity bias' be other than a bias towards things simpler than random in whatever space we are referring to? 'Simpler than random' is what people mean when they talk about simplicity biases.
To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What do you mean by 'similar complexity to the training set'? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything.
Though I hasten to add that SLT doesn't actually say training prefers solutions with low K-complexity. A bias towards low learning coefficients seems to shake out in some sort of mix between a bias toward low K-complecity, and a bias towards speed.
Current LLMs are trivially mesa-optimisers under the original definition of that term.
I don't get why people are still debating the question of whether future AIs are going to be mesa-optimisers. Unless I've missed something about the definition of the term, lots of current AI systems are mesa-optimisers. There were mesa-opimisers around before Risks from Learned Optimization in Advanced Machine Learning Systems was even published.
We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
....
Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.
GPT-4 is capable of making plans to achieve objectives if you prompt it to. It can even write code to find the local optimum of a function, or code to train another neural network, making it a mesa-meta-optimiser. If gradient descent is an optimiser, then GPT-4 certainly is.
Being a mesa-optimiser is just not a very strong condition. Any pre-transformer ml paper that tried to train neural networks to find better neural network training algorithms was making mesa-optimisers. It is very mundane and expected for reasonably general AIs to be mesa-optimisers. Any program that can solve even somewhat general problems is going to have a hard time not meeting the definition of an optimiser.
Maybe this is some sort of linguistic drift at work, where 'mesa-optimiser' has come to refer specifically to a sysytem that is only an optimiser, with one single set of objectives it will always try to accomplish in any situation. Fine.
The result of this imprecise use of the original term though, as I perceive it, is that people are still debating and researching whether future AI's might start being mesa-optimisers, as if that was relevant to the will-they-kill-us-all question. But, at least sometimes, what they seem to actually concretely debate and research is whether future AIs might possibly start looking through search spaces to accomplish objectives, as if that wasn't a thing current systems obviously already do.
Singular Learning Theory explains/predicts this. If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count. Just because most of the loss landscape is taken up by the biggest, and thus simplest, behavioural regions.
You can see this happening if you watch proxies for the effective parameter count while models train. E.g. a modular addition transformer or MNIST MLP start out with very few effective parameters at initialisation, then gain more as the network trains. If the network goes through a grokking transition, you can watch the effective parameter count go down again.
For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?
≈ no change I'd say. We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn't work. So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don't generalise out of distribution. SLT just tells us how that works.
One takeaway might be that observations about how biological brains train are more applicable to AI training than one might have previously thought. Previously, you could've figured that since AIs use variants of gradient descent as their updating algorithm, while the brain uses we-don't-even-know-what, their inductive biases could be completely different.
Now, it's looking like the updating rule you use doesn't actually matter that much for determining the inductive bias. Anything in a wide class of local optimisation methods might give you pretty similar stuff. Some methods are a lot more efficient than others, but the real pixie fairy dust that makes any of this possible is in the architecture, not the updating rule.
(Obviously, it still matters what loss signal you use. You can't just expect that an AI will converge to learn the same desires a human brain would, unless the AI's training signals are similar to those used by the human brain. And we don't know what most of the brain's training signals are.)
This sounds cool and deep but crashes headlong into the issue that the entropy rate and the excess entropy of any stochastic process is time-symmetric.
It's time symmetric around a starting point of low entropy. The further is from , the more entropy you'll have, in either direction. The absolute value is what matters.
In this case, is usually taken to be the big bang. So the further in time you are from the big bang, the less the universe is like a dense uniform soup with little structure that needs description, and the higher your entropy will be. That's how you get the subjective perception of temporal causality.
Presumably, this would hold to the other side of as well, if there is one. But we can't extrapolate past , because close to everything gets really really energy dense, so we'd need to know how to do quantum gravity to calculate what the state on the other side might look like. So we can't check that. And the notion of time as we're discussing it here might break down at those energies anyway.