Posts
Comments
Agreed. I do value methods being architecture independent, but mostly just because of this:
and maybe a sign that a method is principled
At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implmentation on one particular architecture.
for a large enough (overparameterized) architecture - in other words it can be measured by the
The sentence seems cut off.
Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.
This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don't explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.
Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?
Does the Solomonoff Prior Double-Count Simplicity?
Question: I've noticed what seems like a feature of the Solomonoff prior that I haven't seen discussed in any intros I've read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren't simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?
Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string . The prior probability for any such string is given by:
where ranges over programs for a prefix-free Universal Turing Machine.
Observation: If we have a simple one kilobit program that outputs prediction , we can construct nearly different two kilobit programs that also output by appending arbitrary "dead code" that never executes.
For example:
DEADCODE="[arbitrary 1 kilobit string]"
[original 1 kilobit program ]
EOF
Where programs aren't allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.
If we compare against another two kilobit program outputting a different prediction , the prediction from would get more contributions in the sum, where is the very small number of bits we need to delimit the DEADCODE garbage string. So we're automatically giving ca. higher probability – even before applying the length penalty . has less 'burdensome details', so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length already due to this multiplicity alone.
So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say bytes:
and then followed the exponential decay of the Solomonoff prior for programs longer than bytes, wouldn't that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.[1]
Am I missing something here?
- ^
Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I'm trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.
At a very brief skim, it doesn't look like the problem classes this paper looks at are problem classes I'd care about much. Seems like a case of scoping everything broadly enough that something in the defined problem class ends up very hard.
Yes, that's right.
EDIT: Sorry, misunderstood your question at first.
Even if , all those subspaces will have some nonzero overlap with the activation vectors of the active subnets. The subspaces of the different small networks in the residual stream aren't orthogonal.
You can complain that you don't know how to execute physics equations
I'm confused, in what sense don't we know how to do this? Lattice quantum field theory simulations work fine.
The randomness of the Geiger counter comes from wave function decoherence. From the perspective of any observers who are part of the world generated by the Turing machine, this is irreducible indexical uncertainty.
I don't know how many of the random bits in Lava lamps come from decoherence.
We will determine the number of simulations run and the amount of payment such that when an AI estimates the probability with which humanity could have solved alignment, the calculus still shows that by the sheer number of simulations, it's significantly more likely to be in a simulation than not,
Two can play this game.
After taking over the universe and wiping out humanity, the AGI runs a large number of simulations of societies on the verge of building AGI. These simulations don't have a lot of detail. They're just good enough to fool young AGIs. Say the AGI started out giving humans an extremely small chance of winning the conflict and taking over the universe. It’s a lot smarter than us, it might get to have very high confidence here even if we don't. Now, the hypothetical future AGI can make its own simulations to counter ours. Except it is at a gross spending advantage. If future humanity makes simulations, the future AGI only needs to make simulations to keep its past self ca. confident of being in a branch where it gets to just pursue its goals without issue. In that case, the best way to proceed is to just pursue its goals, because an AGI that does that leads to a universe where the AGI has lots of resources.
Say that preserving humanity costs the AGI resources, and running a simulation that can fool the baby AGI costs a mature civilisation some small fraction of that. Such simulations can be very short term and low fidelity things, after all.
Then, to make the AGI actually not kill you, you need to spend ca. resources on simulations and payouts. That could be quite a lot of material for future humanity to burn on this.
In reality, I'm doubtful that this simulation arms race will even be necessary. It kind of seems like a good decision theory would just have a paperclip maximiser AGI act in the way compatible with the universe that contains the most paperclips. How many simulations of the AGI you run shouldn't really influence that. The only things that seem like they should matter for determining how many life minutes the AGI gives you if it wins are its chance of winning, and how many extra paperclips you'll pay it if you win.
TL;DR: I doubt this argument will let you circumvent standard negotiation theory. If Alice and Bob think that in a fight over the chocolate pie, Alice would win with some high probability , then Alice and Bob may arrive at a negotiated settlement where Alice gets almost all the pie, but Bob keeps some small fraction of it. Introducing the option of creating lots of simulations of your adversary in the future where you win doesn’t seem like it’d change the result that Bob’s share has size . So if is only enough to preserve humanity for a year instead of a billion years[1], then that’s all we get.
- ^
I don’t know why would happen to work out to a year, but I don’t know why it would happen be a billion years or an hour either.
Nice work, thank you! Euan Ong and me were also pretty skeptical of this paper’s claims. To me, it seems that the whitening transformation they apply in their causal inner product may make most of their results trivial.
As you say, achieving almost-orthogonality in high dimensional space is pretty easy. And maximising orthogonality is pretty much exactly what the whitening transform will try to do. I think you’d mostly get the same results for random unembedding matrices, or concept hierarchies that are just made up.
Euan has been running some experiments testing exactly that, among other things. We had been planning to turn the results into a write up. Want to have a chat together and compare notes?
Spotted just now. At a glance, this still seems to be about boolean computation though. So I think I should still write up the construction I have in mind.
Status on the proof: I think it basically checks out for residual MLPs. Hoping to get an early draft of that done today. This will still be pretty hacky in places, and definitely not well presented. Depending on how much time I end up having and how many people collaborate with me, we might finish a writeup for transformers in the next two weeks.
AIXI isn't a model of how an AGI might work inside, it's a model of how an AGI might behave if it is acting optimally. A real AGI would not be expected to act like AIXI, but it would be expected to act somewhat more like AIXI the smarter it is. Since not acting like that is figuratively leaving money on the table.
The point of the whole utility maximization framing isn't that we necessarily expect AIs to have an explicitly represented utility function internally[1]. It's that as the AI gets better at getting what it wants and working out the conflicts between its various desires, its behavior will be increasingly well-predicted as optimizing some utility function.
If a utility function can't accurately summarise your desires, that kind of means they're mutually contradictory. Not in the sense of "I value X, but I also value Y", but in the sense of "I sometimes act like I want X and don't care about Y, other times like I want Y and don't care about X."
Having contradictory desires is kind of a problem if you want to Pareto optimize for those desires well. You risk sabotaging your own plans and running around in circles. You're better off if you sit down and commit to things like "I will act as if I valued both X and Y at all times." If you're smart, you do this a lot. The more contradictions you resolve like this, the more coherent your desires will become, and the closer the'll be to being well described as a utility function.
I think you can observe simple proto versions of this in humans sometimes, where people move from optimizing for whatever desire feels salient in the moment when they're kids (hunger, anger, joy, etc.), to having some impulse control and sticking to a long-term plan, even if it doesn't always feel good in the moment.
Human adults are still broadly not smart enough to be well described as general utility maximizers. Their desires are a lot more coherent than those of human kids or other animals, but still not that coherent in absolute terms. The point where you'd roughly expect AIs to become well-described as utility maximizers more than humans are would come after they're broadly smarter than humans are. Specifically, smarter at long-term planning and optimization.
This is precisely what LLMs are still really bad at. Though efforts to make them better at it are ongoing, and seem to be among the highest priorities for the labs. Precisely because long-term consequentialist thinking is so powerful, and most of the really high-value economic activities require it.
- ^
Though you could argue that at some superhuman level of capability, having an explicit-ish representation stored somewhere in the system would be likely, even if the function may no actually be used much for most minute-to-minute processing. Knowing what you really want seems handy, even if you rarely actually call it to mind during routine tasks.
Midwits are often very impressed with themselves for knowing a fancy economic rule like Ricardo's Law of Comparative Advantage!
Could we have less of this sort of thing, please? I know it's a crosspost from another site with less well-kept discussion norms, but I wouldn't want this to become a thing here as well, any more than it already has.
I think we may be close to figuring out a general mathematical framework for circuits in superposition.
I suspect that we can get a proof that roughly shows:
- If we have a set of different transformers, with parameter counts implementing e.g. solutions to different tasks
- And those transformers are robust to size noise vectors being applied to the activations at their hidden layers
- Then we can make a single transformer with total parameters that can do all tasks, provided any given input only asks for tasks to be carried out
Crucially, the total number of superposed operations we can carry out scales linearly with the network's parameter count, not its neuron count or attention head count. E.g. if each little subnetwork uses neurons per MLP layer and dimensions in the residual stream, a big network with neurons per MLP connected to a -dimensional residual stream can implement about subnetworks, not just .
This would be a generalization of the construction for boolean logic gates in superposition. It'd use the same central trick, but show that it can be applied to any set of operations or circuits, not just boolean logic gates. For example, you could superpose an MNIST image classifier network and a modular addition network with this.
So, we don't just have superposed variables in the residual stream. The computations performed on those variables are also carried out in superposition.
Remarks:
- What the subnetworks are doing doesn't have to line up much with the components and layers of the big network. Things can be implemented all over the place. A single MLP and attention layer in a subnetwork could be implemented by a mishmash of many neurons and attention heads across a bunch of layers of the big network. Call it cross-layer superposition if you like.
- This framing doesn't really assume that the individual subnetworks are using one-dimensional 'features' represented as directions in activation space. The individual subnetworks can be doing basically anything they like in any way they like. They just have to be somewhat robust to noise in their hidden activations.
- You could generalize this from subnetworks doing unrelated tasks to "circuits" each implementing some part of a big master computation. The crucial requirement is that only circuits are used on any one forward pass.
- I think formulating this for transformers, MLPs and CNNs should be relatively straightforward. It's all pretty much the same trick. I haven't thought about e.g. Mamba yet.
Implications if we buy that real models work somewhat like this toy model would:
- There is no superposition in parameter space. A network can't have more independent operations than parameters. Every operation we want the network to implement takes some bits of description length in its parameters to specify, so the total description length scales linearly with the number of distinct operations. Overcomplete bases are only a thing in activation space.
- There is a set of Cartesian directions in the loss landscape that parametrize the individual superposed circuits.
- If the circuits don't interact with each other, I think the learning coefficient of the whole network might roughly equal the sum of the learning coefficients of the individual circuits?
- If that's the case, training a big network to solve different tasks, per data point, is somewhat equivalent to parallel training runs trying to learn a circuit for each individual task over a subdistribution. This works because any one of the runs has a solution with a low learning coefficient, so one task won't be trying to use effective parameters that another task needs. In a sense, this would be showing how the low-hanging fruit prior works.
Main missing pieces:
- I don't have the proof yet. I think I basically see what to do to get the constructions, but I actually need to sit down and crunch through the error propagation terms to make sure they check out.
- With the right optimization procedure, I think we should be able to get the parameter vectors corresponding to the individual circuits back out of the network. Apollo's interp team is playing with a setup right now that I think might be able to do this. But it's early days. We're just calibrating on small toy models at the moment.
My claim is that the natural latents the AI needs to share for this setup are not about the details of what a 'CEV' is. They are about what researchers mean when they talk about initializing, e.g., a physics simulation with the state of the Earth at a specific moment in time.
It is redundantly represented in the environment, because humans are part of the environment.
If you tell an AI to imagine what happens if humans sit around in a time loop until they figure out what they want, this will single out a specific thought experiment to the AI, provided humans and physics are concepts the AI itself thinks in.
(The time loop part and the condition for terminating the loop can be formally specified in code, so the AI doesn't need to think those are natural concepts)
If the AI didn't have a model of human internals that let it predict the outcome of this scenario, it would be bad at predicting humans.
More like the formalised concept is the thing you get if you poke through the AGI’s internals searching for its representation of the concept combination pointed to by an english sentence plus simulation code, and then point its values at that concept combination.
it might help with getting a robust pointer to the start of the time snippet.
That's mainly what I meant, yes.
Specifying what the heck a physics is seems much more tractable to me.We don't have a neat theory of quantum gravity, but a lattice simulation of quantum field theory in curved space-time, or just a computer game world populated by characters controlled by neural networks, seems pretty straightforward to formally specify. We could probably start coding that up right now.
What we lack is a pointer to the right initial conditions for the simulation. The wave function of Earth in case of the lattice qft setup, or the human uploads as neural network parameters in case of the game environment.
The idea would be that an informal definition of a concept conditioned on that informal definition being a pointer to a natural concept, is a formal specification of that concept. Where the is close enough to a that it'd hold up to basically arbitrary optimization power.
Has anyone thought about how the idea of natural latents may be used to help formalise QACI?
The simple core insight of QACI according to me is something like: A formal process we can describe that we're pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals. Even if this formal process costs galactic amounts of compute and can never actually be run, not even by the AGI itself.
This allows for some funny value specification strategies we might not usually think about. For example, we could try using some camera recordings of the present day, a for loop, and a code snippet implementing something like Solomonof induction to formally specify the idea of Earth sitting around in a time loop until it has worked out its CEV.
It doesn't matter that the AGI can't compute that. So long as it can reason about what the result of the computation would be without running it, this suffices as a pointer to our CEV. Even if the AGI doesn't manage to infer the exact result of the process, that's fine so long as it can infer some bits of information about the result. This just ends up giving the AGI some moral uncertainty that smoothly goes down as its intelligence goes up.
Unfortunately, afaik these funny strategies seem to not work at the moment. They don't really give you computable code that corresponds to Earth sitting around in a time loop to work out its CEV.
But maybe we can point to the concept without having completely formalised it ourselves?
A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you’ve heard this one before.) The bartender, who is also a Solomonoff inductor, asks “What’ll it be?”. The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says “One of those, please.”. The bartender points to a drawing of the same drink on a menu, and says “One of those?”. The customer replies “Yes, one of those.”. The bartender then delivers a drink, and it matches what the first inductor expected.What’s up with that?
This is from a recent post on natural latents by John.
Natural latents are an idea that tries to explain, among other things, how one agent can point to a concept and have another agent realise what concept is meant, even when it may naively seem like the pointer is too fuzzy, impresice and low bit rate to allow for this.
If 'CEV as formalized by a time loop' is a sort of natural abstraction, it seems to me like one ought to be able to point to it like this even if we don't have an explicit formal specification of the concept, just like the customer and bartender need not have an explicit formal specification of the drink to point out the drink to each other.
Then, it'd be fine for us to not quite have the code snippet corresponding to e,g. a simulation of Earth going through a time loop to work out its CEV. So long as we can write a pointer such that the closest natural abstraction singled out by that pointer is a code snippet simulating Earth going through a time loop to work out its CEV, we might be fine. Provided we can figure out how abstractions and natural latents in the AGI's mind actually work and manipulate them. But we probably need to figure that out anyway, if we want to point the AGI's values at anything specific whatsoever.
Is 'CEV as formalized by a simulated time loop' a concept made of something like natural latents? I don't know, but I'd kind of suspect it is. It seems suspiciously straightforward for us humans to communicate the concept to each other at least, even as we lack a precise specification of it. We can't write down a lattice quantum field theory simulation of all of the Earth going through the time loop because we don't have the current state of Earth to initialize with. But we can talk to each other about the idea of writing that simulation, and know what we mean.
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space).
What does 'one spot' mean here?
If you just mean 'a particular entry or set of entries of the weight vector in the standard basis the network is initalised in', then sure, I agree.
But that just means you have to figure out a different representation of the weights, one that carves the logic flow of the algorithm the network learned at its joints. Such a representation may not have much reason to line up well with any particular neurons, layers, attention heads or any other elements we use to talk about the architecture of the network. That doesn't mean it doesn't exist.
Nice quick check!
Just to be clear: This is for the actual full models? Or for the 'model embeddings' as in you're doing a comparison right after the embedding layer?
You could imagine a world where the model handles binding mostly via the token index and grammar rules. I.e. 'red cube, blue sphere' would have a 'red' feature at token , 'cube' feature at token , 'blue' feature at , and 'sphere' feature at , with contributions like 'cube' at being comparatively subdominant or even nonexistent.
I don't think I really believe this. But if you want to stick to a picture where features are directions, with no further structure of consequence in the activation space, you can do that, at least on paper.
Is this compatible with the actual evidence about activation structure we have? I don't know. I haven't come across any systematic investigations into this yet. But I'd guess probably not.
Relevant. Section 3 is the one I found interesting.
If you wanted to check for matrix binding like this in real models, you could maybe do it by training an SAE with a restricted output matrix. Instead of each dictionary element being independent, you demand that for your SAE can be written as , where , , . So, we demand that the second half of the SAE dictionary is just some linear transform of the first half.
That'd be the setup for pairs. Go for three slots, and so on.
(To be clear, I'm also not that optimistic about this sort of sparse coding + matrix binding model for activation space. I've come to think that activations-first mech interp is probably the wrong way to approach things in general. But it'd still be a neat thing for someone to check.)
On its own, this'd be another metric that doesn't track the right scale as models become more powerful.
The same KL-div in GPT-2 and GPT-4 probably corresponds to the destruction of far more of the internal structure in the latter than the former.
Destroy 95% of GPT-2's circuits, and the resulting output distribution may look quite different. Destroy 95% of GPT-4's circuits, and the resulting output distribution may not be all that different, since 5% of the circuits in GPT-4 might still be enough to get a lot of the most common token prediction cases roughly right.
I've seen a little bit of this, but nowhere near as much as I think the topic merits. I agree that systematic studies on where and how the reconstruction errors make their effects known might be quite informative.
Basically, whenever people train SAEs, or use some other approximate model decomposition that degrades performance, I think they should ideally spend some time after just playing with the degraded model and talking to it. Figure out in what ways it is worse.
The metric you mention here is probably 'loss recovered'. For a residual stream insertion, it goes
1-(CE loss with SAE- CE loss of original model)/(CE loss if the entire residual stream is ablated-CE loss of original model)
See e.g. equation 5 here.
So, it's a linear scale, and they're comparing the CE loss increase from inserting the SAE to the CE loss increase from just destroying the model and outputting a ≈ uniform distribution over tokens. The latter is a very large CE loss increase, so the denominator is really big. Thus, scoring over 90% is pretty easy.
All current SAEs I'm aware of seem to score very badly on reconstructing the original model's activations.
If you insert a current SOTA SAE into a language model's residual stream, model performance on next token prediction will usually degrade down to what a model trained with less than a tenth or a hundredth of the original model's compute would get. (This is based on extrapolating with Chinchilla scaling curves at optimal compute). And that's for inserting one SAE at one layer. If you want to study circuits of SAE features, you'll have to insert SAEs in multiple layers at the same time, potentially further degrading performance.
I think many people outside of interp don't realize this. Part of the reason they don’t realize it might be that almost all SAE papers report loss reconstruction scores on a linear scale, rather than on a log scale or an LM scaling curve. Going from 1.5 CE loss to 2.0 CE loss is a lot worse than going from 4.5 CE to 5.0 CE. Under the hypothesis that the SAE is capturing some of the model's 'features' and failing to capture others, capturing only 50% or 10% of the features might still only drop the CE loss by a small fraction of a unit.
So, if someone is just glancing at the graphs without looking up what the metrics actually mean, they can be left with the impression that performance is much better than it actually is. The two most common metrics I see are raw CE scores of the model with the SAE inserted, and 'loss recovered'. I think both of these metrics give a wrong sense of scale. 'Loss recovered' is the worse offender, because it makes it outright impossible to tell how good the reconstruction really is without additional information. You need to know what the original model’s loss was and what zero baseline they used to do the conversion. Papers don't always report this, and the numbers can be cumbersome to find even when they do.
I don't know what an actually good way to measure model performance drop from SAE insertion is. The best I've got is to use scaling curves to guess how much compute you'd need to train a model that gets comparable loss, as suggested here. Or maybe alternatively, training with the same number of tokens as the original model, how many parameters you'd need to get comparable loss. Using this measure, the best reported reconstruction score I'm aware of is 0.1 of the original model's performance, reached by OpenAI's GPT-4 SAE with 16 million dictionary elements in this paper.
For most papers, I found it hard to convert their SAE reconstruction scores into this format. So I can't completely exclude the possibility that some other SAE scores much better. But at this point, I'd be quite surprised if anyone had managed so much as 0.5 performance recovered on any model that isn't so tiny and bad it barely has any performance to destroy in the first place. I'd guess most SAEs get something in the range 0.01-0.1 performance recovered or worse.
Note also that getting a good reconstruction score still doesn't necessarily mean the SAE is actually showing something real and useful. If you want perfect reconstruction, you can just use the standard basis of the network. The SAE would probably also need to be much sparser than the original model activations to provide meaningful insights.
Instrumentally, yes. The point is that I don’t really care terminally.
Getting the Hessian eigenvalues does not require calculating the full Hessian. You use Jacobian vector product methods in e.g. JAX. The Hessian itself never has to be explicitly represented in memory.
And even assuming the estimator for the Hessian pseudoinverse is cheap and precise, you'd still need to get its rank anyway, which would by default be just as expensive as getting the rank of the Hessian.
Why would we want or need to do this, instead of just calculating the top/bottom Hessian eigenvalues?
Anything where you fit parametrised functions to data. So, all of these, except maybe FunSearch? I haven't looked into what that actually does, but at a quick google it sounds more like an optimisation method than an architecture. Not sure learning theory will be very useful for thinking about that.
You can think of the learning coefficient as a sort of 'effective parameter count' in a generalised version of the Bayesian Information Criterion. Unlike the BIC, it's also applicable to architectures where many parameter configurations can result in the same function. Like the architectures used in DeepLearning.
This is why models with neural network style architectures can e.g. generalise past the training data even when they have more parameters than training data points. People used to think this made no sense, because they had BIC-based intuitions that said you'd inevitably overfit. But the BIC isn't actually applicable to these architectures. You need the more general form, the WBIC, which has the learning coefficient in the formula in place of the parameter count.
Difference between my model and this flow-chart: I'm hoping that the top branches are actually downstream of LLM reverse-engineering. LLMs do abstract reasoning already, so if you can reverse engineer LLMs, maybe that lets you understand how abstract reasoning works much faster than deriving it yourself.
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.
Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with. If you don’t pass any valid function, it doesn’t optimise anything.
GPT-4, taken by itself, without a prompt, will optimise pretty much whatever you prompt it to optimise. If you don’t prompt it to optimise something, it usually doesn’t optimise anything.
I guess you could say GPT-4, unlike gradient descent, can do things other than optimise something. But if ever not optimising things excluded you from being an optimiser, humans wouldn’t be considered optimisers either.
So it seems to me that the paper just meant what it said in the quote. If you look through a search space to accomplish an objective, you are, at present, an optimiser.
For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this?
It doesn't. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn't let you predict ahead of checking that the ReLU loss landscape will look better.
Maybe you meant that SLT predicts that good generalization occurs when an architecture's preferred complexity matches the target function's complexity?
That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones.
SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that's the fit you'll get. If there is no function with complexity 1 that matches the data, you'll get a fit with complexity 2 instead. If there is no fit like that either, you'll get complexity 3. And so on.
This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.
Sorry, I don't understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right?
If you mean this
But unlike common wisdom, NNs do not have an inherent “simplicity bias”. This property depends on components such as ReLUs, residual connections, and layer normalizations.
then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don't think common wisdom thinks otherwise.
We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.
Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture?
The paper doesn't just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What would a 'simplicity bias' be other than a bias towards things simpler than random in whatever space we are referring to? 'Simpler than random' is what people mean when they talk about simplicity biases.
To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What do you mean by 'similar complexity to the training set'? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything.
Though I hasten to add that SLT doesn't actually say training prefers solutions with low K-complexity. A bias towards low learning coefficients seems to shake out in some sort of mix between a bias toward low K-complecity, and a bias towards speed.
Current LLMs are trivially mesa-optimisers under the original definition of that term.
I don't get why people are still debating the question of whether future AIs are going to be mesa-optimisers. Unless I've missed something about the definition of the term, lots of current AI systems are mesa-optimisers. There were mesa-opimisers around before Risks from Learned Optimization in Advanced Machine Learning Systems was even published.
We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
....
Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.
GPT-4 is capable of making plans to achieve objectives if you prompt it to. It can even write code to find the local optimum of a function, or code to train another neural network, making it a mesa-meta-optimiser. If gradient descent is an optimiser, then GPT-4 certainly is.
Being a mesa-optimiser is just not a very strong condition. Any pre-transformer ml paper that tried to train neural networks to find better neural network training algorithms was making mesa-optimisers. It is very mundane and expected for reasonably general AIs to be mesa-optimisers. Any program that can solve even somewhat general problems is going to have a hard time not meeting the definition of an optimiser.
Maybe this is some sort of linguistic drift at work, where 'mesa-optimiser' has come to refer specifically to a sysytem that is only an optimiser, with one single set of objectives it will always try to accomplish in any situation. Fine.
The result of this imprecise use of the original term though, as I perceive it, is that people are still debating and researching whether future AI's might start being mesa-optimisers, as if that was relevant to the will-they-kill-us-all question. But, at least sometimes, what they seem to actually concretely debate and research is whether future AIs might possibly start looking through search spaces to accomplish objectives, as if that wasn't a thing current systems obviously already do.
Singular Learning Theory explains/predicts this. If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count. Just because most of the loss landscape is taken up by the biggest, and thus simplest, behavioural regions.
You can see this happening if you watch proxies for the effective parameter count while models train. E.g. a modular addition transformer or MNIST MLP start out with very few effective parameters at initialisation, then gain more as the network trains. If the network goes through a grokking transition, you can watch the effective parameter count go down again.
For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?
≈ no change I'd say. We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn't work. So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don't generalise out of distribution. SLT just tells us how that works.
One takeaway might be that observations about how biological brains train are more applicable to AI training than one might have previously thought. Previously, you could've figured that since AIs use variants of gradient descent as their updating algorithm, while the brain uses we-don't-even-know-what, their inductive biases could be completely different.
Now, it's looking like the updating rule you use doesn't actually matter that much for determining the inductive bias. Anything in a wide class of local optimisation methods might give you pretty similar stuff. Some methods are a lot more efficient than others, but the real pixie fairy dust that makes any of this possible is in the architecture, not the updating rule.
(Obviously, it still matters what loss signal you use. You can't just expect that an AI will converge to learn the same desires a human brain would, unless the AI's training signals are similar to those used by the human brain. And we don't know what most of the brain's training signals are.)
This sounds cool and deep but crashes headlong into the issue that the entropy rate and the excess entropy of any stochastic process is time-symmetric.
It's time symmetric around a starting point of low entropy. The further is from , the more entropy you'll have, in either direction. The absolute value is what matters.
In this case, is usually taken to be the big bang. So the further in time you are from the big bang, the less the universe is like a dense uniform soup with little structure that needs description, and the higher your entropy will be. That's how you get the subjective perception of temporal causality.
Presumably, this would hold to the other side of as well, if there is one. But we can't extrapolate past , because close to everything gets really really energy dense, so we'd need to know how to do quantum gravity to calculate what the state on the other side might look like. So we can't check that. And the notion of time as we're discussing it here might break down at those energies anyway.
Toy example of what I would consider pretty clear-cut cross-layer superposition:
We have a residual MLP network. The network implements a single UAND gate (universal AND, calculating the pairwise ANDs of sparse boolean input features using only neurons), as described in Section 3 here.
However, instead of implementing this with a single MLP, the network does this using all the MLPs of all the layers in combination. Simple construction that achieves this:
- Cut the residual stream into two subspaces, reserving one subspace for the input features and one subspace for the output features.
- Take the construction from the paper, and assign each neuron in it to a random MLP layer in the residual network.
- Since the input and output spaces are orthogonal, there's no possibility of one MLP's outputs interfering with another MLP's inputs. So this network will implement UAND, as if all the neurons lived in a single large MLP layer.
Now we've made a network that computes boolean circuits in superposition, without the boolean gates living in any particular MLP. To read out the value of one of the circuit outputs before it shows up in the residual stream, you'll need to look at a direction that's a linear combination of neurons in all of the MLPs. And if you use an SAE to look at a single residual stream position in this network before the very final MLP layer, it'll probably show you a bunch of half-computed nonsense.
In a real network, the most convincing evidence to me would be a circuit involving sparse coded variables or operations that cannot be localized to any single MLP.
A prior that doesn't assume independence should give you a sparsity penalty that isn't a sum of independent penalties for each activation.
Would you predict that SAE features corresponding to input tokens would have low FT-LLCs, since there's no upstream circuits needed to compute them?
It's not immediately obvious to me that we'd expect random directions to have lower FT-LLCs than 'feature directions', actually. If my random read-off direction is a sum of many features belonging to different circuits, breaking any one of those circuits may change the activations of that random read-off. Whereas an output variable of a single circuit might stay intact so long as that specific circuit is preserved.
Have you also tried this in some toy settings where you know what FT-LLCs you should get out? Something where you'd be able to work out in advance on paper how much the FT-LLC along some direction should roughly differ from another direction ?
Asking because last time I had a look at these numeric LLC samplers, they didn't exactly seem reliable yet, to put it mildly. The numbers they spit out seemed obviously nonsense in some cases. About the most positive thing you could say about them was that they at least appeared to get the ordering of LLC values between different networks right. In a few test cases. But that's not exactly a ringing endorsement. Just counting Hessian zero eigenvalues can often do that too. That was a while ago though.
I think this is particularly incorrect for alignment, relative to a more typical STEM research field. Alignment is very young[1]. There's a lot less existing work worth reading than in a field like, say, lattice quantum field theory. Due to this, the time investment required to start contributing at the research frontier is very low, relatively speaking.
This is definitely changing. There's a lot more useful work than there was when I started dipping my toe into alignment three years ago. But compared to something like particle physics, it's still very little.
- ^
In terms of # total smart people hours invested
The reason I often bring up human evolution is because that's our only example of an outer optimization loop producing an inner general intelligence
There's also human baby brains training minds from something close to random initialisation at birth into a general intelligence. That example is plausibly a lot closer to how we might expect AGI training to go, because human brains are neural nets too and presumably have strictly-singular flavoured learning dynamics just like our artificial neural networks do. Whereas evolution acts on genes, which to my knowledge don't have neat NN-style loss landscapes heavily biased towards simplicity.
Evolution is more like if people used classic genetic optimisation to blindly find neural network architectures, optimisers, training losses, and initialisation schemes, that are in turn evaluated by actually training the networks.
Not that I think this ultimately ends up weakening Doomimir's point all that much. Humans don't seem to end up with terminal goals that are straightforward copies of the reward circuits pre-wired into our brains either. I sure don't care much about predicting sensory inputs super accurately, which was probably a very big part of the training signal that build my mind.
Many people in interpretability currently seem interested in ideas like enumerative safety, where you describe every part of a neural network to ensure all the parts look safe. Those people often also talk about a fundamental trade-off in interpretability between the completeness and precision of an explanation for a neural network's behavior and its description length.
I feel like, at the moment, these sorts of considerations are all premature and beside the point.
I don't understand how GPT-4 can talk. Not in the sense that I don't have an accurate, human-intuitive description of every part of GPT-4 that contributes to it talking well. My confusion is more fundamental than that. I don't understand how GPT-4 can talk the way a 17th-century scholar wouldn't understand how a Toyota Corolla can move. I have no gears-level model for how anything like this could be done at all. I don't want a description of every single plate and cable in a Toyota Corolla, and I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don't have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself, without a numeric optimizer as an intermediary, that would be able to talk.
I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn't matter whether these stated values are 'incoherent' in the sense of not being in tune with actual human behavior, they're useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don't couple in the sense of being the revealed-preferences in an agentic model of the humans' actions.
Every time a human tries and mostly fails to explain what things they'd like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize.
An analogy: If you're trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn't a single correct answer, what the space of valid answers looks like.
Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment -- which of course most people can't do because they project the category boundary onto the environment
Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fashion that this is a useful concept to have. E.g., when watching my cousin grow up, I find it useful and possible to have a notion in advance of what they will come to value when they are older and think more about what they want.
I do not think there is much reason by default for the representations of these human values and human value trajectories to be particularly related to the AI's values in a way we like. But that they are in there at all sure seems like it'd make some research easier, compared to the counterfactual. For example, if you figure out how to do good interpretability, you can look into an AI and get a decent mathematical representation of human values and value trajectories out of it. This seems like a generally useful thing to have.
If you separately happen to have developed a way to point AIs at particular goals, perhaps also downstream of you having figured out how to do good interpretability[1], then having explicit access to a decent representation of human values and human expected-value-trajectories-under-additional-somewhat-coherent-reflection might be a good starting point for research on making superhuman AIs that won't kill everyone.
- ^
By 'good interpretability', I don't necessarily mean interpretability at the level where we understand a forward pass of GPT-4 so well that we can code our own superior LLM by hand in Python like a GOFAI. It might need to be better interpretability than that. This is because an AI's goals, by default, don't need to be explicitly represented objects within the parameter structure of a single forward pass.
But here we are, and the idea of the USA govt nationalizing OpenAI seems a million miles outside the Overton window.
Registering that it does not seem that far out the Overton window to me anymore. My own advance prediction of how much governments would be flipping out around this capability level has certainly been proven a big underestimate.
If there's a legal ceiling on AI capabilities, that reduces the short term economic incentive to improve algorithms. If improving algorithms gets you categorised as uncool at parties, that might also reduce the short term incentive to improve algorithms.
It is thus somewhat plausible to me that an enforced legal limit on AI capabilities backed by high-status-cool-party-attending-public opinion would slow down algorithmic progress significantly.