Posts
Comments
But they'd be too unchanged: the "afraid of mice" circuit would still be checking for "grey and big and mammal and ..." as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for "grey and big and mammal and ... and high-scrabble-scoring". Any interpretability tool that told you that "grey and big and mammal and ..." was "elephant" in the first model is now going to have difficulty representing the situation.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the 'same' feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.
But this is not at all unique to the sort of model used in the counterexample. A 'normal' model can still have one embedding direction for elephant at one point, used by a circuit , then in fine tuning switch to a slightly different embedding direction . Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to , and so interference can be lowered my moving the embedding around a bit. A circuit learned in fine tuning would then be reading from this and not match which is still reading in . You might argue that will surely want to adjust to start using as well to lower the loss, but that would seem to apply equally well to your example. So I don't see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.
EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between and is mostly 'unstructured', whereas the difference between "grey and big and mammal and ..." and "grey and big and mammal and ... and high-scrabble-scoring" is structured. I don't see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn't seem to me to really matter much whether the discrepancy is 'meaningful' to the model or not.
I'm with @chanind: If elephant is fully represented by a sum of its attributes, then it's quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
...
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 50-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they're secretly taking in a very low-rank subspace of the activations that'd make sense to us if we looked at it. I should probably double check that when I'm more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, 'meaningful' feature embedding geometry that the circuits care about, instead of random feature embeddings.
- ^
EDIT: I am now more awake. I still think this is right.
The kind of 'alignment technique' that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of 'alignment technique' that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don't get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.
How much money would you guess was lost on this?
Yes.
Technically you didn't specify that can't be an arbitrary function, so you'd be able to reconstruct activations combining different bases, but it'd be horribly convoluted in practice.
I wouldn't even be too fussed about 'horribly convoluted' here. I'm saying it's worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn't that we can't make a dictionary that includes all the feature directions as dictionary elements. We can do that. For example, while we can't write
because those sums each already equal on their own, we can write
.
The problem is instead that we can't make a dictionary that has the feature activations as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model's own circuits actually care about. They cannot equal the 'features of the model' in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature through a linear read-off along the direction would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have edges connecting to all of the animal features[1], making up of the total contribution. It's the worst of both worlds. Every circuit looks like a mess now.
- ^
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.
E.g. it's not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction.
You can't represent elephants along with arbitrary combinations of attributes. You can't do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get non-zero attribute features at once maximum.[1]
We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true "base units" that can vary freely.
You can call them the "base units" if you like. But that won't change the fact that some directions in the space spanned by those "base units" are special, with associated circuits that care about those directions in particular, and understanding or even recognising those circuits in a causal graph made of the "base units" will be pretty darned hard. For the same reason trying to understand the network in the neuron basis is hard.
Put another way, there's no way to represent an "elephant" in this scheme without also attaching attributes to it.
Yes.
Likewise, it's not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you're calling an elephant or rabbit.
Not quite. You cannot specify a rabbit and simultaneously specify the rabbit having arbitrary numerical attribute values for attributes differing from normal rabbits. You can have a rabbit, and some attributes treated as sparse boolean-ish features at the same time. E.g. works. Circuits downstream that store facts about rabbits will still be triggered by this . Circuits downstream that do something with attribute will be reading in an -attribute value of plus the -coefficient of rabbits.
A consequence of this is that 'cute rabbit' is a bit cuter than either 'cute' or 'rabbit' on their own. But that doesn't seem particularly strange to me. Associations in my own mind sure seem to work like that.
- ^
Less, if you want to be able to perform computation in superposition.
Similarly, for people wanting to argue from the other direction, who might think a low current valuation is case-closed evidence against their success chances
To be clear: I think the investors would be wrong to think that AGI/ASI soon-ish isn't pretty likely.
OpenAI's valuation is very much reliant on being on a path to AGI in the not-too-distant future.
Really? I'm mostly ignorant on such matters, but I'd thought that their valuation seemed comically low compared to what I'd expect if their investors thought that OpenAI was likely to create anything close to a general superhuman AI system in the near future.[1] I considered this evidence that they think all the AGI/ASI talk is just marketing.
- ^
Well ok, if they actually thought OpenAI would create superintelligence as I think of it, their valuation would plummet because giving people money to kill you with is dumb. But there's this space in between total obliviousness and alarm, occupied by a few actually earnest AI optimists. And, it seems to me, not occupied by the big OpenAI investors.
If I understand correctly, it sounds like you're saying there is a "label" direction for each animal that's separate from each of the attributes.
No, the animal vectors are all fully spanned by the fifty attribute features.
I'm confused why a dictionary that consists of a feature direction for each attribute and each animal label can't explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.
The animal features are sparse. The attribute features are not sparse.[1]
In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features as seen by linear probes and the network's own circuits.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components?
No, that is not the idea.
- ^
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
'elephant' would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of , because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, 'elephant' and 'tiny' would be expected to have read-off interference on the order of . Alternatively, you could instead encode a new animal 'tiny elephant' as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for 'tiny elephant' is 'exampledon', and exampledons just happen to look like tiny elephants.
E.g. the concept of a "furry elephant" or a "tiny elephant" would be unrepresentable in this scheme
It's representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
I feel like in this scheme, it's not really the case that there's 1000 animal directions, since the base unit is the attributes
In what sense? If you represent the network computations in terms of the attribute features, you will get a very complicated computational graph with lots of interaction lines going all over the place. So clearly, the attributes on their own are not a very good basis for understanding the network.
Similarly, you can always represent any neural network in the standard basis of the network architecture. Trivially, all features can be seen as mere combinations of these architectural 'base units'. But if you try to understand what the network is doing in terms of interactions in the standard basis, you won't get very far.
For there to be a true "elephant" direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a "label" direction that indicates "elephant" that's mostly orthogonal to every other feature so it can be queried uniquely via projection.
The 'elephant' feature in this setting is mostly-orthogonal to every other feature in the ontology, including the features that are attributes. So it can be read out with a linear projection. 'elephant' and 'pink' shouldn't have substantially higher cosine similarity than 'elephant' and 'parrot'.
you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Yes.
I don't think I am very good at explaining my thoughts on this in text. Some prior writings that have informed my models here are the MIRI dialogues, and the beginning parts of Steven Byrnes' sequence on brain-like AGI, which sketch how the loss functions human minds train on might look and gave me an example apart from evolution to think about.
Some scattered points that may or may not be of use:
- There is something here about path dependence. Late in training at high capability levels, very many things the system might want are compatible with scoring very well on the loss, because the system realises that doing things that score well on the loss is instrumentally useful. Thus, while many aspects of how the system thinks are maybe nailed down quite definitively and robustly by the environment, what it wants does not seem nailed down in this same robust way. Desires thus seem like they can be very chaotically dependent on dynamics in early training, what the system reflected on when, which heuristics it learned in what order, and other low level details like this that are very hard to precisely control.
- I feel like there is something here about our imaginations, or at least mine, privileging the hypothesis. When I imagine an AI trained to say things a human observer would rate as 'nice', and to not say things a human observer rates as 'not nice', my imagination finds it natural to suppose that this AI will generalise to wanting to be a nice person. But when I imagine an AI trained to respond in English, rather than French or some other language, I do not jump to supposing that this AI will generalise to terminally valuing the English language.
Every training signal we expose the AI to reinforces very many behaviours at the same time. The human raters that may think they are training the AI to be nice are also training it to respond in English (because the raters speak English), to respond to queries at all instead of ignoring them, to respond in English that is grammatically correct enough to be understandable, and a bunch of other things. The AI is learning things related to 'niceness', 'English grammar' and 'responsiveness' all at the same time. Why would it generalise in a way that entangles its values with one of these concepts, but not the others?
What makes us single out the circuits responsible for giving nice answers to queries as special, as likely to be part of the circuit ensemble that will cohere into the AI's desires when it is smarter? Why not circuits for grammar or circuits for writing in the style of 1840s poets or circuits for research taste in geology?
We may instinctively think of our constitution that specifies as equivalent to some sort of monosemantic -reinforcing training signal. But it really isn't. The concept of sticks out to us when we we look at the text of the constitution, because the presence of concept is a thing that makes this text different from a generic text. But the constitution, and even more so any training signal based on the constitution, will by necessity be entangled with many concepts besides just , and the training will reinforce those concepts as well. Why then suppose that the AI's nascent shard of value are latching on to , but are not in the same way latching on to all the other stuff its many training signals are entangled with?
It seems to me that there is no good reason to suppose this. Niceness is part of my values, so when I see it in the training signal I find it natural to imagine that the AI's values would latch on to it. But I do not as readily register all the other concepts in the training signal the AI's values might latch on to, because to my brain that does not value these things, they do not seem value-related. - There is something here about phase changes under reflection. If the AI gets to the point of thinking about itself and its own desires, the many shards of value it may have accumulated up to this point are going to amalgamate into something that may be related to each of the shards, but not necessarily in a straightforwardly human-intuitive way. For example, sometimes humans that have value shards related to empathy reflect on themselves, and emerge being negative utilitarians that want to kill everyone. For another example, sometimes humans reflect on themselves and seem to decide that they don't like the goals they have been working towards, and they'd rather work towards different goals and be different people. There, the relationship between values pre-reflection and post-reflection can be so complicated that it can seem to an outside observer and the person themselves like they just switched values non-deterministically, by a magical act of free will. So it's not enough to get some value shards that are kind of vaguely related to human values into the AI early in training. You may need to get many or all of the shards to be more than just vaguely right, and you need the reflection process to proceed in just the right way.
Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can't get the dictionary activations to equal the feature activations , .
I did not know about this already.
For the same reasons training an agent on a constitution that says to care about does not, at arbitrary capability levels, produce an agent that cares about .
If you think that doing this does produce an agent that cares about even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.
The features a model thinks in do not need to form a basis or dictionary for its activations.
Three assumptions people in interpretability often make about the features that comprise a model’s ontology:
- Features are one-dimensional variables.
- Meaning, the value of feature on data point can be represented by some scalar number .
- Features are ‘linearly represented’.
- Features form a 'basis' for activation space.[3]
- Meaning, the model’s activations at a given layer can be decomposed into a sum over all the features of the model represented in that layer[4]: .
It seems to me that a lot of people are not tracking that 3) is an extra assumption they are making. I think they think that assumption 3) is a natural consequence of assumptions 1) and 2), or even just of assumption 2) alone. It's not.
Counterexample
Model setup
Suppose we have a language model that has a thousand sparsely activating scalar, linearly represented features for different animals. So, "elephant", "giraffe", "parrot", and so on all with their own associated feature directions . The model embeds those one thousand animal features in a fifty-dimensional sub-space of the activations. This subspace has a meaningful geometry: It is spanned by a set of fifty directions corresponding to different attributes animals have. Things like “furriness”, “size”, “length of tail” and such. So, each animal feature can equivalently be seen as either one of a thousand sparsely activating scalar feature, or just as a particular setting of those fifty not-so-sparse scalar attributes.
Some circuits in the model act on the animal directions . E.g. they have query-key lookups for various facts about elephants and parrots. Other circuits in the model act on the attribute directions . They’re involved in implementing logic like ‘if there’s a furry animal in the room, people with allergies might have problems’. Sometimes they’re involved in circuits that have nothing to do with animals whatsoever. The model’s "size" attribute is the same one used for houses and economies for example, so that direction might be read-in to a circuit storing some fact about economic growth.
So, both the one thousand animal features and the fifty attribute features are elements of the model’s ontology, variables along which small parts of its cognition are structured. But we can’t make a basis for the model activations out of those one thousand and fifty features of the model. We can write either , or . But does not equal the model activation vector , it’s too large.
Doing interp on this model
Say we choose as our basis for this subspace of the example model's activations, and then go on to make a causal graph of the model’s computation, with each basis element being a node in the graph, and lines between nodes representing connections. Then the circuits dealing with query-key lookups for animal facts will look neat and understandable at a glance, with few connections and clear logic. But the circuits involving the attributes will look like a mess. A circuit reading in the size direction will have a thousand small but collectively significant connections to all of the animals.
If we choose as our basis for the graph instead, circuits that act on some of the fifty attributes will look simple and sensible, but now the circuits storing animal facts will look like a mess. A circuit implementing "space" AND "cat" => [increase association with rainbows] is going to have fifty connections to features like “size” and “furriness’.
The model’s ontology does not correspond to either the basis or the basis. It just does not correspond to any basis of activation space at all, not even in a loose sense. Different circuits in the model can just process the activations in different bases, and they are under no obligation to agree with each other. Not even if they are situated right next to each other, in the same model layer.
Note that for all of this, we have not broken assumption 1) or assumption 2). The features this model makes use of are all linearly represented and scalar. We also haven’t broken the secret assumption 0) I left out at the start, that the model can be meaningfully said to have an ontology comprised of elementary features at all.
Takeaways
I’ve seen people call out assumptions 1) and 2), and at least think about how we can test whether they hold, and how we might need to adjust our interpretability techniques if and when they don't hold. I have not seen people do this for assumption 3). Though I might just have missed it, of course.
My current dumb guess is that assumption 2) is mostly correct, but assumptions 1) and 3) are both incorrect.
The reason I think assumption 3) is incorrect is that the counterexample I sketched here seems to me like it'd be very common. LLMs seem to be made of lots of circuits. Why would these circuits all share a basis? They don't seem to me to have much reason to.
I think a way we might find the model’s features without assumption 3) is to focus on the circuits and computations first. Try to directly decompose the model weights or layer transitions into separate, simple circuits, then infer the model’s features from looking at the directions those circuits read and write to. In the counterexample above, this would have shown us both the animal features and the attribute features.
- ^
Potentially up to some small noise. For a nice operationalisation, see definition 2 on page 3 of this paper.
- ^
It's a vector because we've already assumed that features are all scalar. If a feature was two-dimensional instead, this would be a projection into an associated two-dimensional subspace.
- ^
I'm using the term basis loosely here, this also includes sparse overcomplete 'bases' like those in SAEs. The more accurate term would probably be 'dictionary', or 'frame'.
- ^
Or if the computation isn't layer aligned, the activations along some other causal cut through the network can be written as a sum of all the features represented on that cut.
I think the value proposition of AI 2027-style work lies largely in communication. Concreteness helps people understand things better. The details are mostly there to provide that concreteness, not to actually be correct.
If you imagine the set of possible futures that people like Daniel, you or I think plausible as big distributions, with high entropy and lots of unknown latent variables, the point is that the best way to start explaining those distributions to people outside the community is to draw a sample from them and write it up. This is a lot of work, but it really does seem to help. My experience matches habryka's here. Most people really want to hear concrete end-to-end scenarios, not abstract discussion of the latent variables in my model and their relationships.
The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints.
Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it.
Can also discuss on a call if you like.
Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.
You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?
Because we have the prediction error bounds.
When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn't involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.
Yes.
That’s fine. I just want a computable predictor that works well. This one does.
Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff.
Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program running program is pretty much never going to be the shortest implementation of .
If you make an agent by sticking together cut-off Solomonoff induction and e.g. causal decision theory, I do indeed buy that this agent will have problems. Because causal decision theory has problems.
Thank you for this summary.
I still find myself unconvinced by all the arguments against the Solomonoff prior I have encountered. For this particular argument, as you say, there's still many ways the conjectured counterexample of adversaria could fail if you actually tried to sit down and formalise it. Since the counterexample is designed to break a formalism that looks and feels really natural and robust to me, my guess is that the formalisation will indeed fall to one of these obstacles, or a different one.
In a way, that makes perfect sense; Solomonoff induction really can't run in our universe! Any robot we could build to "use Solomonoff induction" would have to use some approximation, which the malign prior argument may or may not apply to.
You can just reason about Solomonoff induction with cut-offs instead. If you render the induction computable by giving it a uniform prior over all programs of some finite length [1] with runtime , it still seems to behave sanely. As in, you can derive analogs of the key properties of normal Solomonoff induction for this cut-off induction. E.g. the induction will not make more than bits worth of prediction mistakes compared to any 'efficient predictor' program with runtime and K-complexity , it's got a rough invariance to what Universal Turing Machine you run it on, etc. .
Since finite, computable things are easier for me to reason about, I mostly use this cut-off induction in my mental toy models of AGI these days.
EDIT: Apparently, this exists in the literature under the name AIXI-tl. I didn't know that. Neat.
- ^
So, no prefix-free requirement.
- A quick google search says the male is primary or exclusive breadwinner in a majority of married couples. Ass-pull number: the monetary costs alone are probably ~50% higher living costs. (Not a factor of two higher, because the living costs of two people living together are much less than double the living costs of one person. Also I'm generally considering the no-kids case here; I don't feel as confused about couples with kids.
But remember that you already conditioned on 'married couples without kids'. My guess would be that in the subset of man-woman married couples without kids, the man being the exclusive breadwinner is a lot less common than in the set of all man-woman married couples. These properties seem like they'd be heavily anti-correlated.
In the subset of man-woman married couples without kids that get along, I wouldn't be surprised if having a partner effectively works out to more money for both participants, because you've got two incomes, but less than 2x living expenses.
- I was picturing an anxious attachment style as the typical female case (without kids). That's unpleasant on a day-to-day basis to begin with, and I expect a lack of sex tends to make it a lot worse.
I am ... not ... picturing that as the typical case? Uh, I don't know what to say here really. That's just not an image that comes to mind for me when I picture 'older hetero married couple'. Plausibly I don't know enough normal people to have a good sense of what normal marriages are like.
- Eyeballing Aella's relationship survey data, a bit less than a third of respondents in 10-year relationships reported fighting multiple times a month or more. That was somewhat-but-not-dramatically less than I previously pictured. Frequent fighting is very prototypically the sort of thing I would expect to wipe out more-than-all of the value of a relationship, and I expect it to be disproportionately bad in relationships with little sex.
I think for many of those couples that fight multiple times a month, the alternative isn't separating and finding other, happier relationships where there are never any fights. The typical case I picture there is that the relationship has some fights because both participants aren't that great at communicating or understanding emotions, their own or other people's. If they separated and found new relationships, they'd get into fights in those relationships as well.
It seems to me that lots of humans are just very prone to getting into fights. With their partners, their families, their roommates etc., to the point that they have accepted having lots of fights as a basic fact of life. I don't think the correct takeaway from that is 'Most humans would be happier if they avoided having close relationships with other humans.'
- Less legibly... conventional wisdom sure sounds like most married men find their wife net-stressful and unpleasant to be around a substantial portion of the time, especially in the unpleasant part of the hormonal cycle, and especially especially if they're not having much sex. For instance, there's a classic joke about a store salesman upselling a guy a truck, after upselling him a boat, after upselling him a tackle box, after [...] and the punchline is "No, he wasn't looking for a fishing rod. He came in looking for tampons, and I told him 'dude, your weekend is shot, you should go fishing!'".
Conventional wisdom also has it that married people often love each other so much they would literally die for their partner. I think 'conventional wisdom' is just a very big tent that has room for everything under the sun. If even 5-10% of married couples have bad relationships where the partners actively dislike each other, that'd be many millions of people in the English speaking population alone. To me, that seems like more than enough people to generate a subset of well-known conventional wisdoms talking about how awful long-term relationships are.
Case in point, I feel like I hear those particular conventional wisdoms less commonly these days in the Western world. My guess is this is because long-term heterosexual marriage is no longer culturally mandatory, so there's less unhappy couples around generating conventional wisdoms about their plight.
So, next question for people who had useful responses (especially @Lucius Bushnaq and @yams): do you think the mysterious relationship stuff outweighs those kinds of costs easily in the typical case, or do you imagine the costs in the typical case are not all that high?
So, in summary, both I think? I feel like the 'typical' picture of a hetero marriage you sketch is more like my picture of an 'unusually terrible' marriage. You condition on a bad sexual relationship and no children and the woman doesn't earn money and the man doesn't even like her, romantically or platonically. That subset of marriages sure sounds like it'd have a high chance of the man just walking away, barring countervailing cultural pressures. But I don't think most marriages where the sex isn't great are like that.
Sure, I agree that, as we point out in the post
Yes, sorry I missed that. The section is titled 'Conclusions' and comes at the end of the post, so I guess I must have skipped over it because I thought it was the post conclusion section rather than the high-frequency latents conclusion section.
As long as your evaluation metrics measure the thing you actually care about...
I agree with this. I just don't think those autointerp metrics robustly capture what we care about.
Removing High Frequency Latents from JumpReLU SAEs
On a first read, this doesn't seem principled to me? How do we know those high-frequency latents aren't, for example, basis directions for dense subspaces or common multi-dimensional features? In that case, we'd expect them to activate frequently and maybe appear pretty uninterpretable at a glance. Modifying the sparsity penalty to split them into lower frequency latents could then be pathological, moving us further away from capturing the features of the model even though interpretability scores might improve.
That's just one illustrative example. More centrally, I don't understand how this new penalty term relates to any mathematical definition that isn't ad-hoc. Why would the spread of the distribution matter to us, rather than simply the mean? If it does matter to us, why does it matter in roughly the way captured by this penalty term?
The standard SAE sparsity loss relates to minimising the description length of the activations. I suspect that isn't the right metric to optimise for understanding models, but it is at least a coherent, non-ad-hoc mathematical object.
EDIT: Oops, you address all that in the conclusion, I just can't read.
Forgot to tell you this when you showed me the draft: The comp in sup paper actually had a dense construction for UAND included already. It works differently than the one you seem to have found though, using Gaussian weights rather than binary weights.
I will continue to do what I love, which includes reading and writing and thinking about biosecurity and diseases and animals and the end of the world and all that, and I will scrape out my existence one way or another.
Thank you. As far as I'm aware we don't know each other at all, but I really appreciate you working to do good.
I don't think the risks of talking about the culture war have gone down. If anything, it feels like it's yet again gotten worse. What exactly is risky to talk about has changed a bit, but that's it. I'm more reluctant than ever to involve myself in culture war adjacent discussions.
This comment by Carl Feynman has a very crisp formulation of the main problem as I see it.
They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact that our current methods are incapable of discerning this is no obstacle to people who claim to know.
For a counterexample, look at the field of planetary science. Scanty evidence dribbles in from occasional spacecraft missions and telescopic observations, but the field is intellectually sound because public attention doesn’t rest on the outcome.
So, the recipe for making a broken science you can't trust is
- The public cares a lot about answers to questions that fall within the science's domain.
- The science currently has no good attack angles on those questions.
As you say, if a field is exposed to these incentives for a while, you get additional downstream problems like all the competent scientist who care about actual progress leaving. But I think that's a secondary effect. If you replaced all the psychology grads with physics and electrical engineering grads overnight, I'd expect you'd at best get a very brief period of improvement before the incentive gradient brought the field back to the status quo. On the other hand, if the incentives suddenly changed, I think reforming the field might become possible.
This suggests that if you wanted to found new parallel fields of nutrition, psychology etc. you could trust, you should consider:
- Making it rare for journalists to report on your new fields. Maybe there's just a cultural norm against talking to the press and publishing on Twitter. Maybe people have to sign contracts about it if they want to get grants. Maybe the research is outright siloed because it is happening inside some company.
- Finding funders who won't demand answers if answers can't be had. Seems hard. This might exclude most companies. The usual alternative is government&charity, but those tend to care too much about what the findings are. My model of how STEM manages to get useful funding out of them is that funding STEM is high-status, but STEM results are mostly too boring and removed from the public interest for the funders to get invested in them.
Relationship ... stuff?
I guess I feel kind of confused by the framing of the question. I don't have a model under which the sexual aspect of a long-term relationship typically makes up the bulk of its value to the participants. So, if a long-term relationship isn't doing well on that front, and yet both participants keep pursuing the relationship, my first guess would be that it's due to the value of everything that is not that. I wouldn't particularly expect any one thing to stick out here. Maybe they have a thing where they cuddle and watch the sunrise together while they talk about their problems. Maybe they have a shared passion for arthouse films. Maybe they have so much history and such a mutually integrated life with partitioned responsibilities that learning to live alone again would be a massive labour investment, practically and emotionally. Maybe they admire each other. Probably there's a mixture of many things like that going on. Love can be fed by many little sources.
So, this I suppose:
Their romantic partner offering lots of value in other ways. I'm skeptical of this one because female partners are typically notoriously high maintenance in money, attention, and emotional labor. Sure, she might be great in a lot of ways, but it's hard for that to add up enough to outweigh the usual costs.
I don't find it hard at all to see how that'd add up to something that vastly outweighs the costs, and this would be my starting guess for what's mainly going on in most long-term relationships of this type.
This data seems to be for sexual satisfaction rather than romantic satisfaction or general relationship satisfaction.
How sub-light? I was mostly just guessing here, but if it’s below like 0.95c I’d be surprised.
It expands at light speed. That's fast enough that no computational processing can possibly occur before we're dead. Sure there's branches where it maims us and then stops, but these are incredibly subdominant compared to branches where the tunneling doesn't happen.
Yes, you can make suicide machines very reliable and fast. I claim that whether your proposed suicide machine actually is reliable does in fact matter for determining whether you are likely to find yourself maimed. Making suicide machines that are synchronised earth-wide seems very difficult with current technology.
This. The struggle is real. My brain has started treating publishing a LessWrong post almost the way it'd treat publishing a paper. An acquaintance got upset at me once because they thought I hadn't provided sufficient discussion of their related Lesswrong post in mine. Shortforms are the place I still feel safe just writing things.
It makes sense to me that this happened. AI Safety doesn't have a journal, and training programs heavily encourage people to post their output on LessWrong. So part of it is slowly becoming a journal, and the felt social norms around posts are morphing to reflect that.
I don't think anything in the linked passage conflicts with my model of anticipated experience. My claim is not that the branch where everyone dies doesn't exist. Of course it exists. It just isn't very relevant for our future observations.
To briefly factor out the quantum physics here, because they don't actually matter much:
If someone tells me that they will create a copy of me while I'm anesthetized and unconscious, and put one of me in a room with red walls, and another of me in a room with blue walls, my anticipated experience is that I will wake up to see red walls with and blue walls with . Because the set of people who will wake up and remember being me and getting anesthetized has size 2 now, and until I look at the walls I won't know which of them I am.
If someone tells me that they will create a copy of me while I'm asleep, but they won't copy the brain, making it functionally just a corpse, then put the corpse in a room with red walls, and me in a room with blue walls, my anticipated experience is that I will wake up to see blue walls with p=1.0. Because the set of people who will wake up and remember being me and going to sleep has size 1. There is no chance of me 'being' the corpse any more than there is a chance of me 'being' a rock. If the copy does include a brain, but the brain gets blown up with a bomb before the anaesthesia wears off, that doesn't change anything. I'd see blue walls with , not see blue walls with and 'not experience anything' with .
The same basic principle applies to the copies of you that are constantly created as the wavefunction decoheres. The probability math in that case is slightly different because you're dealing with uncertainty over a vector space rather than uncertainty over a set, so what matters is the squares of the amplitudes of the branches that contain versions of you. E.g. if there's three branches, one in which you die, amplitude , one in which you wake up to see red walls, amplitude and one in which you wake up to see blue walls, amplitude , you'd see blue walls with probability ca. and red walls with probability .[1]
- ^
If you start making up scenarios that involve both wave function decoherence and having classical copies of you created, you're dealing with probabilities over vector spaces and probabilities over sets at the same time. At that point, you probably want to use density matrices to do calculations.
There may be a sense in which amplitude is a finite resource. Decay your branch enough, and your future anticipated experience might come to be dominated by some alien with higher amplitude simulating you, or even just by your inner product with quantum noise in a more mainline branch of the wave function. At that point, you lose pretty much all ability to control your future anticipated experience. Which seems very bad. This is a barrier I ran into when thinking about ways to use quantum immortality to cheat heat death.
I don't think so. You only need one alien civilisation in our light cone to have preferences about the shape of the universal wave function rather than their own subjective experience for our light cone to get eaten. E.g. a paperclip maximiser might want to do this.
Also, the fermi paradox isn't really a thing.
No, because getting shot has a lot of outcomes that do not kill you but do cripple you. Vacuum decay should tend to have extremely few of those. It’s also instant, alleviating any lingering concerns about identity one might have in a setup where death is slow and gradual. It’s also synchronised to split off everyone hit by it into the same branch, whereas, say, a very high-yield bomb wired to a random number generator that uses atmospheric noise would split you off into a branch away from your friends.[1]
I’m not unconcerned about vacuum decay, mind you. It’s not like quantum immortality is all confirmed and the implications worked out well in math.[2]
- ^
They’re still there for you of course, but you aren’t there for most of them. Because in the majority of their anticipated experience, you explode.
- ^
Sometimes I think about the potential engineering applications of quantum immortality in a mature civilisation for fun. Controlled, synchronised civilisation-wide suicide seems like a neat way to transform many engineering problems into measurement problems.
Since I didn't see it brought up on a skim: One reason me and some of my physicist friends aren't that concerned about vacuum decay is many-worlds. Since the decay is triggered by quantum tunneling and propagates at light speed, it'd be wiping out earth in one wavefunction branch that has amplitude roughly equal to the amplitude of the tunneling, while the decay just never happens in the other branches. Since we can't experience being dead, this wouldn't really affect our anticipated future experiences in any way. The vacuum would just never decay from our perspective.
So, if the vacuum were confirmed to be very likely meta-stable, and the projected base rate of collapses was confirmed to be high enough that it ought to have happened a lot already, we'd have accidentally stumbled into a natural and extremely clean experimental setup for testing quantum immortality.
I disagreed with Gwern at first. I'm increasingly forced to admit there's something like bipolar going on here
What changed your mind? I don't know any details about the diagnostic criteria for bipolar besides those you and Gwern brought up in that debate. But looking at the points you made back then, it's unclear to me which of them you'd consider to be refuted or weakened now.
Musk’s ordinary behavior - intense, risk-seeking, hard-working, grandiose, emotional - does resemble symptoms of hypomania (full mania would usually involve psychosis, and even at his weirdest Musk doesn’t meet the clinical definition for this).
But hypomania is usually temporary and rare. A typical person with bipolar disorder might have hypomania for a week or two, once every few years. Musk is always like this. Bipolar disorder usually starts in one’s teens. But Musk was like this even as a child.
....
His low periods might meet criteria for a mixed episode. But a bipolar disorder that starts in childhood, continues all the time, has no frank mania, and has only mixed episodes instead of depression - doesn’t really seem like bipolar disorder to me. I’m not claiming there’s nothing weird about him, or that he doesn’t have extreme mood swings. I’m just saying it is not exactly the kind of weirdness and mood swings I usually associate with bipolar.
...
I notice the non-psychiatrists (including very smart people I usually trust) lining up on one side, and the psychiatrists on the other. I think this is because Musk fits a lot of the explicit verbally described symptoms of the condition, but doesn’t resemble real bipolar patients.
...
This isn't how I expect bipolar to work. There is no "switch flipping" (except very occasionally when a manic episode follows directly after a depressive one). A patient will be depressed for weeks or months, then gradually come out of it, and after weeks or months of coming out of it, get back to normal. Being "moody" in the sense of having mood swings is kind of the opposite of bipolar; I would associate it more with borderline or PTSD.
Based on my understanding of what you are doing, the statement in the OP that in your setting is "sort of" K-complexity is a bit misleading?
Yes, I guess it is. In my (weak) defence, I did put a '(sort of)' in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis expressed on two different UTMs.
If we have a UTM and a different UTM , we know that , because if nothing else we can simulate UTM on UTM and compute on the simulated . But in real life, we'd usually expect the actual shortest program that implements on to not involve jumping through hoops like this.
In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement on the NN by simulating a UTM, and running on that simulated UTM. So the lowest LLC parameter configuration that implements on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements through this simulation route. Or that was the intuition I had starting out anyway.
If I understand correctly you are probably doing something like:
Seems broadly right to me except:
- Third bullet point: I don't know what you mean by a "smooth relaxation" precisely. So while this sounds broadly correct to me as a description of what I do, I can't say for sure.
- Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I'd get a constant prefactor before . Even if I'm right that the prefactor I have right now could be improved, I'd still expect at least a here.
I'd caution that the exact relation to the learning coefficient and the LLC is the part of this story I'm still the least confident about at the moment. As the intro said
This post is my current early-stage sketch of the proof idea. Don't take it too seriously yet. I’m writing this out mostly to organise my own thoughts.
I've since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they're still hasty scrawlings.
you are treating the iid case
I am not sure whether I am? I'm a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs to be independent of each other. It does assume that each label is a function of its corresponding input rather than some other input. So, label can depend on input , but it can only depend on in a manner mediated by . In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?
From your message it seems like you think the global learning coefficient might be lower than , but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than "the number of bits used" when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I'd say this and the workability of other schemes to get rid of the prefactactor to in the bound are probably the biggest source of uncertainty about this at the moment.
If schemes like this work, the story here probably ends up as something more like ' is related to the number of bits in the parameters we need to fix to implement on the transformer.'
In that case, you'd be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.
I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I'd thought once I actually engaged with the details. Now, I don't know what I'll end up finding.
- ^
Well, transformers are not actually Turing complete in real life where parameters aren't real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don't have infinite memory either.
Kind of? I'd say the big difference are
- Experts are pre-wired to have a certain size, components can vary in size from tiny query-key lookup for a single fact to large modules.
- IIRC, MOE networks use a gating function to decide which experts to query. If you ignored this gating and just use all the experts, I think that'd break the model. In contrast, you can use all APD components on a forward pass if you want. Most of them just won't affect the result much.
MOE experts don't completely ignore 'simplicity' as we define it in the paper though. A single expert is simpler than the whole MOE network in that it has lower rank/ fewer numbers are required to describe its state on any given forward pass.
Why would this be restricted to cyber attacks? If the CCP believed that ASI was possible, even if they didn't believe in the alignment problem, the US developing an ASI would plausibly constitute an existential threat to them. It'd mean they lose the game of geopolitics completely and permanently. I don't think they'd necessarily restrict themselves to covert sabotage in such a situation.
The possibility of stability through dynamics like mutually assured destruction has been where a lot of my remaining hope on the governance side has come from for a while now.
A big selling point of this for me is that it does not strictly require countries to believe that ASI is possible and that the alignment problem is real. Just believing that ASI is possible is enough.
Because it’s actually not very important in the limit. The dimensionality of V is what matters. A 3-dimensional sphere in the loss landscape always takes up more of the prior than a 2-dimensional circle, no matter how large the area of the circle is and how small the volume of the sphere is.
In real life, parameters are finite precision floats, and so this tends to work out to an exponential rather than infinite size advantage. So constant prefactors can matter in principle. But they have to be really really big.
Yes. I think this may apply to basically all somewhat general minds.
Doesn't exist.[1] If is finite, you can insert AIT-style inequalities into the posterior to get bounds like the one I wrote above. This is neat if you e.g. have more than datapoints.
If f is infinite, you probably want to expand in instead. I haven't done that yet, but I expect to get a bound that looks a lot like the standard free energy formula, with the K-complexity terms in the bound I wrote above showing up where the learning coefficient would usually be. The prefactor probably gets swapped out for a .
It'd still be an upper bound, not an equality, just as in AIT. The learning coefficient can still be smaller than this. This makes sense to me. There might be less complicated ways for the transformer to make an efficient prediction than simulating a UTM and running some program on it.
- ^
Except for the implicit dependence in and , since those are the KL-divergences summed over datapoints.
You either think of the NN weights as a countable set (by e.g. truncating precision "as in real life") in which case you get something like but this is sort of weak sauce: you get this for any prior you want to put over your discrete set of NN weights, no implied connection to K-complexity unless you put one in by hand by taking .
No, you don't need to put in by hand. A uniform prior over NN weights does the job.[1]
The trick is that a transformer run in recurrent mode can
- Simulate a (time and space bounded) UTM in a few transformer blocks
- Use the other transformer blocks to store program code to feed that UTM as input.
A uniform prior over neural network parameters then effectively implies a uniform prior over programs to run on the simulated UTM, modulo the bit specification cost of the UTM simulator and the storage setup. Because for every bit of program code less we need to store, we get degrees of freedom in the weights.
Since induction with a uniform prior on the input strings to a plain monotone UTM effectively gets us a weighting of hypotheses that’s exponential in K-complexity, we'll get an error bound with a term proportional to , plus an offset term for specifying the UTM and storage in the transformer weights.
For the sake of concreteness: If I partially adapted your notation, and went to the special case where the data-generating process is exactly realisable in the weights of the transformer[2], I'd currrently seem to get a bound .[3]
Here, is the number of bits per neural network parameter[4], is the number of parameters needed to implement the UTM on the recurrent transformer architecture , is the K-complexity of data-generating program on the UTM in bits, and is the width of the residual stream.
The prefactor is there because my current construction is stupidly inefficient at storing program code in the weights. I think it ought to be possible to do better, and get this down to a 1. Don't quote me on that though, I don't have a proof yet.
If we don't assume realisability, we can instead take any 'efficient predictor' program that is realisable on the transformer, and get the bound
.
So to summarise
K-complexity usually enters the theory via a choice of prior, and in continuous model classes priors show up in the constant order terms of asymptotic expansions in .
The result here is exactly that we don't need to put in the K-complexity[5] via choice of prior. If we're using a recurrent neural network, the K-complexity is in the prior already, just as it is on a plain monotone UTM. The architecture itself is what implements the bias toward simplicity.
Note also that in the case of continuous parameters, so bits per float going to infinity, the K-complexity terms in the bound do not become constant order terms, because they have as a prefactor. This is one way to start seeing that the K-complexity and the learning coefficient are pretty directly related quantities in the setting of recurrent neural networks.
- ^
I expect a Gaussian prior or anything else of the sort probably works as well, and yields a nigh-identical bound. But I haven’t shown that yet.[6]
- ^
My actual bound doesn't need that assumption. Getting rid of the realisability assumption is what the effective predictor stuff is all about.
- ^
The terms become increasingly irrelevant as float precision gets larger. Basically, I'm using large negative biases to zero out storage neurons that are not needed. In the continuum limit, this would make the weights connecting to those neurons degenerate, and we could integrate them out of the measure. But since we're in the discrete setting, we have to keep track of the fact that very large magnitudes of the weights that overwhelm the negative biases and switch the neuron on again aren't allowed. This makes our volume of allowed parameter configurations just a little bit smaller.
- ^
So, for 8-bit floats, for 16-bit floats, etc. .
- ^
Defined relative to a time and space bounded universal Turing machine.
- ^
EDIT: As in I haven't shown it in the case of finite float precision NN parameters yet. It of course straightforwardly follows in the SLT setting where NN parameters are real numbers and we consider the limit of number of datapoints going to infinity. The shape of the prior can't matter much there, as you say.