Posts

Attribution-based parameter decomposition 2025-01-25T13:12:11.031Z
Activation space interpretability may be doomed 2025-01-08T12:49:38.421Z
Intricacies of Feature Geometry in Large Language Models 2024-12-07T18:10:51.375Z
Deep Learning is cheap Solomonoff induction? 2024-12-07T11:00:56.455Z
Circuits in Superposition: Compressing many small neural networks into one 2024-10-14T13:06:14.596Z
The Hessian rank bounds the learning coefficient 2024-08-08T20:55:36.960Z
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team 2024-07-18T14:15:50.248Z
Lucius Bushnaq's Shortform 2024-07-06T09:08:43.607Z
Apollo Research 1-year update 2024-05-29T17:44:32.484Z
Interpretability: Integrated Gradients is a decent attribution method 2024-05-20T17:55:22.893Z
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks 2024-05-20T17:53:25.985Z
Charbel-Raphaël and Lucius discuss interpretability 2023-10-30T05:50:34.589Z
Announcing Apollo Research 2023-05-30T16:17:19.767Z
Basin broadness depends on the size and number of orthogonal features 2022-08-27T17:29:32.508Z
What Is The True Name of Modularity? 2022-07-01T14:55:12.446Z
Ten experiments in modularity, which we'd like you to run! 2022-06-16T09:17:28.955Z
Project Intro: Selection Theorems for Modularity 2022-04-04T12:59:19.321Z
Theories of Modularity in the Biological Literature 2022-04-04T12:48:41.834Z
Welcome to the SSC Dublin Meetup 2020-07-30T18:56:36.627Z

Comments

Comment by Lucius Bushnaq (Lblack) on eggsyntax's Shortform · 2025-01-29T14:23:52.438Z · LW · GW

I wonder whether this is due to the fact that he's used to thinking about human brains, where we're (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.

I don't think this description is philosophically convenient. Believing  and believing things that imply  are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer's definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.

Comment by Lucius Bushnaq (Lblack) on eggsyntax's Shortform · 2025-01-29T12:16:13.927Z · LW · GW

(Abstract) I argue for the importance of propositional interpretability, which involves interpreting a system’s mechanisms and behavior in terms of propositional attitudes
...
(Page 5) Propositional attitudes can be divided into dispositional and occurrent. Roughly speaking, occurrent attitudes are those that are active at a given time. (In a neural network, these would be encoded in neural activations.) Dispositional attitudes are typically inactive but can be activated. (In a neural network, these would be encoded in the weights.) For example, I believe Paris is the capital of France even when I am asleep and the belief is not active. That is a dispositional belief. On the other hand, I may actively judge France has won more medals than Australia. That is an occurrent mental state, sometimes described as an “occurrent belief”, or perhaps better, as a “judgment” (so judgments are active where beliefs are dispositional). One can make a similar distinction for desires and other attitudes.

I don't like it. It does not feel like a clean natural concept in the territory to me.

Case in point:

(Page 9) Now, it is likely that a given AI system may have an infinite number of propositional attitudes, in which case a full log will be impossible. For example, if a system believes a proposition p, it arguably dispositionally believes p-or-q for all q. One could perhaps narrow down to a finite list by restricting the log to occurrent propositional attitudes, such as active judgments. Alternatively, we could require the system to log the most significant propositional attitudes on some scale, or to use a search/query process to log all propositional attitudes that meet a certain criterion.

I think what this is showing is that Chalmer's definition of "dispositional attitudes" has a problem: It lacks any notion of the amount and kind of computational labour required to turn 'dispositional' attitudes into 'occurrent' ones. That's why he ends up with AI systems having an uncountably infinite number of dispositional attitudes. 

One could try to fix up Chalmer's definition by making up some notion of computational cost, or circuit complexity or something of the sort, that's required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff  we are free to pick as applications demand.

But I don't feel very excited about that. At that point, what is this notion of "dispositional attitudes" really still providing us that wouldn't be less cumbersome to describe in the language of circuits? There, you don't have this problem. An AI can have a query-key lookup for proposition  and just not have a query-key lookup for the proposition . Instead, if someone asks whether  is true, it first performs the lookup for , then uses some general circuits for evaluating simple propositional logic to calculate that  is true. This is an importantly different computational and mental process from having a query-key lookup for  in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer's proposed log of 'propositional attitudes' would do this. It'd describe both of these networks the same way, as having a propositional attitude of believing , discarding a distinction between them that is important for understanding the models' mental state in a way that will let us do things such as successfully predicting the models' behavior in a different situation. 

I'm all for trying to come up with good definitions for model macro-states which throw away tiny implementation details that don't matter, but this definition does not seem to me to carve the territory in quite the right way. It throws away details that do matter.

Comment by Lucius Bushnaq (Lblack) on Attribution-based parameter decomposition · 2025-01-29T06:22:40.598Z · LW · GW

The idea of the motivation is indeed that you want to encode the attribution of each rank-1 piece separately. In practice, computing the attribution of  as a whole actually does involve calculating the attributions of all rank-1 pieces and summing them up, though you're correct that nothing we do requires storing those intermediary results. 

While it technically works out, you are pointing at a part of the math that I think is still kind of unsatisfying. If Bob calculates the attributions and sends them to Alice, why would Alice care about getting the attribution of each rank-1 pieces separately if she doesn't need them to tell what component to activate? Why can't Bob just sum them before he sends them? It kind of vaguely makes sense to me that Alice would want the state of a multi-dimensional object on the forward pass described with multiple numbers, but what exactly are we assuming she wants that state for? It seems that she has to be doing something with it that isn't just running her own sparser forward pass.

I'm brooding over variations of this at the moment, trying to find something for Alice to do that connects better to what we actually want to do. Maybe she is trying to study the causal traces of some forward passes, but pawned the cost of running those traces off to Bob, and now she wants to get the shortest summary of the traces for her investigation under the constraint that uncompressing the summary shouldn't cost her much compute. Or maybe Alice wants something else. I don't know yet.

Comment by Lucius Bushnaq (Lblack) on Lucius Bushnaq's Shortform · 2025-01-28T12:01:54.848Z · LW · GW

This paper claims to sample the Bayesian posterior of NN training, but I think it's wrong.

"What Are Bayesian Neural Network Posteriors Really Like?" (Izmailov et al. 2021) claims to have sampled the Bayesian posterior of some neural networks conditional on their training data (CIFAR-10, MNIST, IMDB type stuff) via Hamiltonian Monte Carlo sampling (HMC). A grand feat if true! Actually crunching Bayesian updates over a whole training dataset for a neural network that isn't incredibly tiny is an enormous computational challenge. But I think they're mistaken and their sampler actually isn't covering the posterior properly.

They find that neural network ensembles trained by Bayesian updating, approximated through their HMC sampling, generalise worse than neural networks trained by stochastic gradient descent (SGD). This would have been incredibly surprising to me if it were true. Bayesian updating is prohibitively expensive for real world applications, but if you can afford it, it is the best way to incorporate new information. You can't do better.[1] 

This is kind of in the genre of a lot of papers and takes I think used to be around a few years back, which argued that the then still quite mysterious ability of deep learning to generalise was primarily due to some advantageous bias introduced by SGD. Or momentum, or something along these lines. In the sense that SGD/momentum/whatever were supposedly diverging from Bayesian updating in a way that was better rather than worse

I think these papers were wrong, and the generalisation ability of neural networks actually comes from their architecture, which assigns exponentially more weight configurations to simple functions than complex functions. So, most training algorithms will tend to favour making simple updates, and tend to find simple solutions that generalise well, just because there's exponentially more weight settings for simple functions than complex functions. This is what Singular Learning Theory talks about. From an algorithmic information theory perspective, I think this happens for reasons similar to why exponentially more binary strings correspond to simple programs than complex programs in Turing machines.

This picture of neural network generalisation predicts that SGD and other training algorithms should all generalise worse than Bayesian updating, or at best do similarly. They shouldn't do better

So, what's going on in the paper? How are they finding that neural network ensembles updated on the training data with Bayes rule make predictions that generalise worse than predictions made by neural networks trained the normal way?

My guess: Their Hamiltonian Monte Carlo (HMC) sampler isn't actually covering the Bayesian posterior properly. They try to check that it's doing a good job by comparing inter-chain and intra-chain variance in the functions learned. 

We apply the classic Gelman et al. (1992) “” potential-scale-reduction diagnostic to our HMC runs. Given two or more chains,  estimates the ratio between the between-chain variance (i.e., the variance estimated by pooling samples from all chains) and the average within-chain variance (i.e., the variances estimated from each chain independently). The intuition is that, if the chains are stuck in isolated regions, then combining samples from multiple chains will yield greater diversity than taking samples from a single chain.

They seem to think a good  in function space implies that the chains are doing a good job of covering the important parts of the space. But I don't think that's true. You need to mix in weight space, not function space, because weight space is where the posterior lives. Function space and weight space are not bijective, that's why it's even possible for simpler functions to have exponentially more prior than complex functions. So good mixing in function space does not necessarily imply good mixing in weight space, which is what we actually need. The chains could be jumping from basin to basin very rapidly instead of spending more time in the bigger basins corresponding to simpler solutions like they should. 

And indeed, they test their chains' weight space  value as well, and find that it's much worse:

Figure 2. Log-scale histograms of  convergence diagnostics. Function-space s are computed on the test-set softmax predictions of the classifiers and weight-space s are computed on the raw weights. About 91% of CIFAR-10 and 98% of IMDB posterior-predictive probabilities get an  less than 1.1. Most weight-space  values are quite small, but enough parameters have very large s to make it clear that the chains are sampling from different distributions in weight space.
...
(From section 5.1) In weight space, although most parameters show no evidence of poor mixing, some have very large s, indicating that there are directions in which the chains fail to mix.

....
(From section 5.2) The qualitative differences between (a) and (b) suggest that while each HMC chain is able to navigate the posterior geometry the chains do not mix perfectly in the weight space, confirming our results in Section 5.1.

So I think they aren't actually sampling the Bayesian posterior. Instead, their chains jump between modes a lot and thus unduly prioritise low-volume minima compared to high volume minima. And those low-volume minima are exactly the kind of solutions we'd expect to generalise poorly.

I don't blame them here. It's a paper from early 2021, back when very few people understood the importance of weight space degeneracy properly aside from some math professor in Japan whom almost nobody in the field had heard of. For the time, I think they were trying something very informative and interesting. But since the paper has 300+ citations and seems like a good central example of the SGD-beats-Bayes genre, I figured I'd take the opportunity to comment on it now that we know so much more about this. 

The subfield of understanding neural network generalisation has come a long way in the past four years.

Thanks to Lawrence Chan for pointing the paper out to me. Thanks also to Kaarel Hänni and Dmitry Vaintrob for sparking the argument that got us all talking about this in the first place.

  1. ^

     See e.g. the first chapters of Jaynes for why.

Comment by Lucius Bushnaq (Lblack) on Attribution-based parameter decomposition · 2025-01-26T19:40:25.274Z · LW · GW
  • curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once ADP is more mature)

We think this may not be a problem here, because the definition of parameter component 'activity' is very constraining. See Appendix section A.1. 

To count as inactive, it's not enough for components to not influence the output if you turn them off, every point on every possible monotonic trajectory between all components being on, and only the components deemed 'active' being on, has to give the same output. If you (approximately) check for this condition, I think the function that picks the learned masks can kind of be as expressive as it likes, because the sparse forward pass can't rely on the mask to actually perform any useful computation labor. 

Conceptually, this is maybe one of the biggest difference between APD and something like, say, a transcoder or a crosscoder. It's why it doesn't seem to me like there'd be an analog to 'feature splitting' in APD. If you train a transcoder on a -dimensional linear transformation, it will learn ever sparser approximations of this transformation the larger you make the transcoder dictionary, with no upper limit. If you train APD on a -dimensional linear transformation, provided it's tuned right, I think it should learn a single -dimensional component. Regardless of how much larger than d you make the component dictionary. Because if it tried to learn more components than that to get a sparser solution, it wouldn't be able to make the components sum to the original model weights anymore.

Despite this constraint on its structure, I think APD plausibly has all the expressiveness it needs, because even when there is an overcomplete basis of features in activation space, circuits in superposition math and information theory both suggest that you can't have an overcomplete basis of mechanisms in parameter space. So it seems to me that you can just demand that components must compose linearly, without that restricting their ability to represent the structure of the target model. And that demand then really limits the ability to sneak in any structure that wasn't originally in the target model.

Comment by Lucius Bushnaq (Lblack) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-24T21:06:02.482Z · LW · GW

Yes, I don't think this will let you get away with no specification bits in goal space at the top level like John's phrasing might suggest. But it may let you get away with much less precision? 

The things we care about aren't convergent instrumental goals for all terminal goals, the kitchen chef's constraints aren't doing that much to keep the kitchen liveable to cockroaches. But it seems to me that this maybe does gesture at a method to get away with pointing at a broad region of goal space instead of a near-pointlike region.

Comment by Lucius Bushnaq (Lblack) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-24T20:43:35.459Z · LW · GW

On first read the very rough idea of it sounds ... maybe right? It seems to perhaps actually centrally engage with the source of my mind's intuition that something like corrigibility ought to exist?

Wow. 

I'd love to get a spot check for flaws from a veteran of the MIRI corrigibility trenches.

Comment by Lucius Bushnaq (Lblack) on We don't want to post again "This might be the last AI Safety Camp" · 2025-01-24T12:54:19.632Z · LW · GW

It's disappointing that you wrote me off as a crank in one sentence. I expect more care, including that you also question your own assumptions.

I think it is very fair that you are disappointed. But I don't think I can take it back. I probably wouldn’t have introduced the word crank myself here. But I do think there’s a sense in which Oliver’s use of it was accurate, if maybe needlessly harsh. It does vaguely point at the right sort of cluster in thing-space.

It is true that we discussed this and you engaged with a lot of energy and in good faith. But I did not think Forrest’s arguments were convincing at all, and I couldn’t seem to manage to communicate to you why I thought that. Eventually, I felt like I wasn’t getting through to you, Quintin Pope also wasn’t getting through to you, and continuing started to feel draining and pointless to me.

I emerged from this still liking you and respecting you, but thinking that you are wrong about this particular technical matter in a way that does seem like the kind of thing people imagine when they hear ‘crank’.

Comment by Lucius Bushnaq (Lblack) on Mechanisms too simple for humans to design · 2025-01-23T12:11:46.594Z · LW · GW

This. Though I don't think the interpretation algorithm is the source of most of the specification bits here.

To make an analogy with artificial neural networks, the human genome needs to contain a specification of the architecture, the training signal and update algorithm, and some basic circuitry that has to work from the start, like breathing. Everything else can be learned. 

I think the point maybe holds up slightly better for non-brain animal parts, but there's still a difference between storing a blueprint for what proteins cells are supposed to make and when, and storing the complete body plan of the resulting adult organism. The latter seems like a closer match to a Microsoft Word file.

If you took the adult body plans of lots of butterflies, and separated all the information of an adult butterfly bodyplan into the bits common to all of the butterflies, and the bits specifying the exact way things happened to grow in this butterfly, the former is more or less[1] what would need to fit into the butterfly genome, not the former plus the latter.

EDIT: Actually, maybe that'd be overcounting what the genome needs to store as well. How individual butterfly bodies grow might be determined by the environment, meaning some of their complexity would actually be specified by the environment, just as in the case of adult butterfly brains. Since this could be highly systematic (the relevant parts of the environment are nigh-identical for all butterflies), those bits would not be captured in our sample of butterfly variation. 

  1. ^

    Up to the bits of genome description length that vary between individual butterflies, which I'd guess would be small compared to both the bits specifying the butterfly species and the bits specifying details of the procedural generation outcome in individual butterflies?

Comment by Lucius Bushnaq (Lblack) on We don't want to post again "This might be the last AI Safety Camp" · 2025-01-23T10:03:49.189Z · LW · GW

I have heard from many people near AI Safety camp that they also have judged AI safety camp to have gotten worse as a result of this.

Hm. This does give me serious pause. I think I'm pretty close to the camps but I haven't heard this. If you'd be willing to share some of what's been relayed to you here or privately, that might change my decision. But what I've seen of the recent camps still just seemed very obviously good to me? 

I don't think Remmelt has gone more crank on the margin since I interacted with him in AISC6. I thought AISC6 was fantastic and everything I've heard about the camps since then still seemed pretty great.

I am somewhat worried about how it'll do without Linda. But I think there's a good shot Robert can fill the gap. I know he has good technical knowledge, and from what I hear integrating him as an organiser seems to have worked well. My edition didn't have Linda as organiser either.

I think I'd rather support this again than hope something even better will come along to replace it when it dies. Value is fragile. 

Comment by Lucius Bushnaq (Lblack) on The Case Against AI Control Research · 2025-01-23T07:54:42.267Z · LW · GW

That's not clear to me? Unless they have a plan to ensure future ASIs are aligned with them or meaningfully negotiate with them, ASIs seem just as likely to wipe out any earlier non-superhuman AGIs as they are to wipe out humanity. 

I can come up with specific scenarios where they'd be more interested in sabotaging safety research than capabilities research, as well as the reverse, but it's not evident to me that the combined probability mass of the former outweighs the latter or vice-versa. 

If someone has an argument for this I would be interested in reading it.

Comment by Lucius Bushnaq (Lblack) on Jesse Hoogland's Shortform · 2025-01-22T07:39:09.549Z · LW · GW

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process

This line caught my eye while reading. I don't know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice? 

Comment by Lucius Bushnaq (Lblack) on The Case Against AI Control Research · 2025-01-22T07:00:06.911Z · LW · GW

Yes, I am reinforcing John's point here. I think the case for control being a useful stepping stone for solving alignment of ASI seems to rely on a lot conditionals that I think are unlikely to hold.

I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.

I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror. The situation is sufficiently desperate that I am willing in principle to stomach some moral horror (unaligned ASI would likely kill any other AGIs we made before it as well), but not if it isn't even going to save the world.

Comment by Lucius Bushnaq (Lblack) on The Case Against AI Control Research · 2025-01-22T06:37:54.864Z · LW · GW

I am commenting more on your proposal to solve the "get useful research" problem here than the "get useful research out of AIs that are scheming against you" problem, though I do think this objection applies to both. I can see a world in which misalignment and scheming of early AGI is an actual blocker to their usefulness in research and other domains with sparse feedback in a very obvious and salient way. In that world, solving the "get useful research out of AIs that are scheming against you" ramps economic incentives for making smarter AIs up further. 

I think that this is a pretty general counterargument to most game plans for alignment that don't include a step of "And then when get a pause on ASI development somehow" at some point in the plan.

Comment by Lucius Bushnaq (Lblack) on The Case Against AI Control Research · 2025-01-21T20:48:43.019Z · LW · GW

Note that I consider the problem of "get useful research out of AIs that are scheming against you" to be in scope for AI control research. We've mostly studied the "prevent scheming AIs from causing concentrated catastrophes" problem in the past, because it's simpler for various technical reasons. But we're hoping to do more on the "get useful research" problem in future (and Benton et al is an initial step in this direction). (I'm also excited for work on "get useful research out of AIs that aren't scheming against you"; I think that the appropriate techniques are somewhat different depending on whether the models are scheming or not, which is why I suggest studying them separately.)

What would you say to the objection that people will immediately try to use such techniques to speed up ASI research just as much as they will try to use them to speed up alignment research if not more? Meaning they wouldn't help close the gap between alignment research and ASI development and might even make it grow larger faster?

If we were not expecting to solve the alignment problem before ASI is developed in a world where nobody knows how to get very useful research out of AIs, why would techniques for getting useful research out of AIs speed up alignment research more than ASI research and close the gap? 

Comment by Lucius Bushnaq (Lblack) on The Case Against AI Control Research · 2025-01-21T20:22:45.884Z · LW · GW

And also on optimism that people are not using these controlled AIs that can come up with new research agendas and new ideas to speed up ASI research just as much. 

Without some kind of pause agreement, you are just making the gap between alignment and ASI research not grow even larger even faster than it already is compared to the counterfactual of capabilities researchers adopting AIs that 10x general science speed and alignment researchers not doing that. You are not actually closing the gap and making alignment research finish before ASI development when it counterfactually wouldn't have in a world where nobody used pre-ASI AIs to speed up any kind of research at all.

Comment by Lucius Bushnaq (Lblack) on What Is The Alignment Problem? · 2025-01-20T07:34:04.128Z · LW · GW

Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploiting abstractions,” which is something that having limited resources forces you to do. I.e., intelligence comes from the boundedness.

I used to think this, but now I don't quite think it anymore. The largest barrier I saw here was that the search had to prioritise simple hypotheses over complex ones. I had not idea how to do this. It seemed like it might require very novel search algorithms, such that models like AIXI were eliding basically all of the key structure of intelligence by not specifying this very special search process.  

I no longer think this. Privileging simple hypotheses in the search seems way easier than I used to think. It is a feature so basic you can get it almost by accident. Many search setups we already know about do it by default. I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction. Both in the sense that the training itself is kind of like approximated Solomonoff induction, and in the sense that the learned network algorithms may be making use of what is basically approximated Solomonoff induction in specialised hypotheses spaces to perform 'general pattern recognition' on their forward passes.

I still think “abstraction-based-cognition” is an important class of learned algorithms that we need to understand, but a picture of intelligence that doesn't talk about abstraction and just refers to concepts like AIXI no longer seems to me to be so incomplete as to not be saying much of value about the structure of intelligence at all.

Comment by Lucius Bushnaq (Lblack) on Linkpost: Rat Traps by Sheon Han in Asterisk Mag · 2025-01-16T22:33:24.877Z · LW · GW

Come to LessWrong Community Weekend in Europe, we still have 'weird' people around.

I don't know how we stack up to the pre-MoR crowd and I've never seen anyone who looked like they just got out of prison, but it's definitely not a bunch of people talking about normal politics or trying to make career connections.

Comment by Lucius Bushnaq (Lblack) on Cole Wyeth's Shortform · 2025-01-13T23:31:23.070Z · LW · GW

I used to, as a child. I did accept a lawful universe, but I thought my perception of free will was in tension with that, so that perception must be "an illusion". 

My mother kept trying to explain to me that there was no tension between these things, because it was correct that my mind made its own decisions rather than some outside force. I didn't understand what she was saying though. I thought she was just redefining 'free will' from a claim that human brains effectively had a magical ability to spontaneously ignore the laws of physics to a boring tautological claim that human decisions are made by humans rather than something else.

I changed my mind on this as a teenager. I don't quite remember how, it might have been the sequences or HPMOR again. I realised that my imagination had still been partially conceptualising the "laws of physics" as some sort of outside force, a set of strings pulling my atoms around, rather than as a predictive description of me and the universe. Saying "the laws of physics make my decisions, not me" made about as much sense as saying "my fingers didn't move, my hand did." That was what my mother had been trying to tell me.

Comment by Lucius Bushnaq (Lblack) on Fabien's Shortform · 2025-01-11T12:17:10.908Z · LW · GW

I think I mostly agree with this for current model organisms, but it seems plausible to me that well chosen studies conducted on future systems that are smarter in an agenty way, but not superintelligent, could yield useful insights that do generalise to superintelligent systems. 

Not directly generalise mind you, but maybe you could get something like "Repeated intervention studies show that the formation of coherent self-protecting values in these AIs works roughly like  with properties . Combined with other things we know, this maybe suggests that the general math for how training signals relate to values is a bit like , and that suggests what we thought of as 'values' is a thing with type signature ."

And then maybe type signature  is actually a useful building block for a framework which does generalise to superintelligence.

I am not particularly hopeful here. Even if we do get enough time to study agenty AIs that aren't superintelligent, I have an intuition that this sort of science could turn out to be pretty intractable for reasons similar to why psychology turned out to be pretty intractable. I do think it might be worth a try though.

Comment by Lucius Bushnaq (Lblack) on Activation space interpretability may be doomed · 2025-01-10T13:57:27.540Z · LW · GW

I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn't seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers.

Sure, it's possible in principle to notice that there is a subspace that can be represented factored into a direct sum. But how do you tell whether you in fact ought to represent it in that way, rather than as composed features, to match the features of the model? Just because the compositional structure is present in the activations doesn't mean the model cares about it. 
 

I don't think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive.

I agree that it is not a knockdown argument. That is why the title isn't "Activation space interpretability is doomed." 

Comment by Lucius Bushnaq (Lblack) on Alexander Gietelink Oldenziel's Shortform · 2025-01-09T16:12:04.276Z · LW · GW

Hm, feels off to me. What privileges the original representation of the uncompressed file as the space in which locality matters? I can buy the idea that understanding is somehow related to a description that can separate the whole into parts, but why do the boundaries of those parts have to live in the representation of the file I'm handed? Why can't my explanation have parts in some abstract space instead? Lots of explanations of phenomena seem to work like that.

Comment by Lucius Bushnaq (Lblack) on Activation space interpretability may be doomed · 2025-01-09T15:39:21.484Z · LW · GW

Thank you. Yes, our claim isn't that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with  dictionary elements each, with an average of  features active at a time in each factor space. Then the dictionary of  composed features has an  of , whereas the dictionary of  factored features has an  of , so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of  factored features is sparser doesn't mean that those are the features of the model. The model could be using the  composed features instead, because that's more convenient for the downstream computations somehow, or for some other reason.

Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest 

Comment by Lucius Bushnaq (Lblack) on Activation space interpretability may be doomed · 2025-01-09T14:04:10.308Z · LW · GW

The third term in that. Though it was in a somewhat different context related to the weight partitioning project mentioned in the last paragraph, not SAE training.

Yes, brittle in hyperparameters. It was also just very painful to train in general. I wouldn't straightforwardly extrapolate our experience to a standard SAE setup though, we had a lot of other things going on in that optimisation. 

Comment by Lucius Bushnaq (Lblack) on Activation space interpretability may be doomed · 2025-01-09T13:41:30.345Z · LW · GW

In my limited experience, attribution-patching style attributions tend to be a pain to optimise for sparsity. Very brittle. I agree it seems like a good thing to keep poking at though. 

Comment by Lucius Bushnaq (Lblack) on Activation space interpretability may be doomed · 2025-01-09T13:07:47.814Z · LW · GW

See the second to last paragraph. The gradients of downstream quantities with respect to the activations contains information and structure that is not part of the activations. So in principle, there could be a general way to analyse the right gradients in the right way on top of the activations to find the features of the model. See e.g. this for an attempt to combine PCAs of activations and gradients together.

Comment by Lucius Bushnaq (Lblack) on Nina Panickssery's Shortform · 2025-01-07T09:11:14.309Z · LW · GW

Space has resources people don't own. The earth's mantle a couple thousand feet down potentially has resources people don't own. More to the point maybe, I don't think humans will be able to continue enforcing laws barring a hostile takeover in the way you seem to think.

Imagine we find out that aliens are headed for earth and will arrive in a few years. Just from the light emissions their probes and expanding civilisation give off, we can infer that they're obviously more technologically mature than us, probably already engineered themselves to be much smarter than us, and can basically do whatever they want with the atoms that make up our solar system and there's nothing we can do about it. We don't know what they want yet though. Maybe they're friendly?

I think guessing that the aliens will be friendly and share human morality to an extent seems like a pretty specific guess about their minds to be making, and is maybe false more likely than not. But guessing that they don't care about human preferences or well-being but do care about human legal structures, that they won't at all help you or gift you things, also won't disassemble you and your property for its atoms[1], but will try to buy atoms from those whom the atoms belong to according to human legal records, now that strikes me as a really really really specific guess to be making that is very likely false.

Superintelligent AGIs don't start out having giant space infrastructure, but qualitatively, I think they'd very quickly overshadow the collective power of humanity in a similar manner. They can see paths through the future to accomplish their goals much better than we can, routing around attempts by us to oppose them. The force that backs up our laws does not bind them. If you somehow managed to align them, they might want to follow some of our laws, because they care about them. But if someone managed to make them care about the legal system, they probably also managed to make them care about your well-being. Few humans, I think, would not at all care about other humans' welfare, but would care about the rule of law, when choosing what to align their AGI with. That's not a kind of value system that shows up in humans much.

So in that scenario, you don't need a legal claim to part of the pre-exisiting economy to benefit from the superintelligences' labours. They will gift some of their labour to you. Say the current value of the world economy is , owned by humans roughly in proportion to how much money they have, and two years after superintelligence the value of the economy is , with ca.  of the new surplus owned by aligned superintelligences[2] because they created most of that value, and ca.  owned by rich humans who sold the superintelligence valuable resources and infrastructure to get the new industrial base started faster[3]. The superintelligence will then probably distribute its gains among humans according to some system that either treats conscious minds pretty equally, or follows the idiosyncratic preferences of the faction that aligned it, not according to how large a fraction of the total economy they used to own two years ago. So someone who started out with much more money than you two years ago doesn't have much more money in expectation now than you do.

  1. ^

    For its conserved quantum numbers really

  2. ^

    Or owned by whomever the superintelligences take orders from.

  3. ^

    You can't just demand super high share percentages from the superintelligence in return for that startup capital. It's got all the resource owners in the world as potential bargain partners to compete with you. And really, the only reason it wouldn't be steering the future into a deal where you get almost nothing, or just steal all your stuff, is to be nice to you. Decision theoretically, this is a handout with extra steps, not a negotiation between equals.

Comment by Lucius Bushnaq (Lblack) on quila's Shortform · 2025-01-07T07:29:47.593Z · LW · GW

Are you so sure that there is not a single interesting, a priori deducible fact about the superintelligent economy beyond "a singleton is in charge and everything is utopia"?

End points are easier to infer than trajectories, so sure, I think there's some reasonable guesses you can try to make about how the world might look after aligned superintelligence, should we get it somehow. 

For example, I think it's a decent bet that basically all minds would exist solely as uploads almost all of the time, because living directly in physical reality is astronomically wasteful and incredibly inconvenient. Turning on a physical lamp every time you want things to be brighter means wiggling about vast numbers of particles and wasting an ungodly amount of negentropy just for the sake of the teeny tiny number of bits about these vast numbers of particles that actually make it to your eyeballs, and the even smaller number of bits that actually end up influencing your mind state and making any difference to your perception of the world. All of the particles[1] in the lamp in my bedroom, the air its light shines through, and the walls it bounces off, could be so much more useful arranged in an ordered dance of logic gates where every single movement and spin flip is actually doing something of value. If we're not being so incredibly wasteful about it, maybe we can run whole civilisations for aeons on the energy and negentropy that currently make up my bedroom. What we're doing right now is like building an abacus out of supercomputers. I can't imagine any mature civilisation would stick with this.

It's not that I refuse to speculate about how  a world post aligned superintelligence might look. I just didn't think that your guess was very plausible. I don't think pre-existing property rights or state structures would matter very much in such a world, even if we don't get what is effectively a singleton, which I doubt. If a group of superintelligent AGIs is effectively much more powerful and productive than the entire pre-existing economy, your legal share of that pre-existing economy is not a very relevant factor in your ability to steer the future and get what you want. The same goes for pre-existing military or legal power. 

  1. ^

    Well, the conserved quantum numbers of my room, really.

Comment by Lucius Bushnaq (Lblack) on Reasons for and against working on technical AI safety at a frontier AI lab · 2025-01-05T22:04:30.277Z · LW · GW

I think safetywashing is a problem but from the perspective of an xrisky researcher it's not a big deal because for the audiences that matter, there are safetywashing things that are just way cheaper per unit of goodwill than xrisk alignment work - xrisk is kind of weird and unrelatable to anyone who doesn't already take it super seriously. I think people who work on non xrisk safety or distribution of benefits stuff should be more worried about this.

Weird it may be, but it is also somewhat influential among people who matter. The extended LW-sphere is not without influence and also contains good ml-talent for the recruiting pool. I can easily see the case that places like Anthropic/Deepmind/OpenAI[1] benefit from giving the appearance of caring about xrisk and working on it. 

  1. ^

     until recently

Comment by Lucius Bushnaq (Lblack) on Alexander Gietelink Oldenziel's Shortform · 2025-01-05T20:40:27.535Z · LW · GW

In AI alignment, the entropic force pulls toward sampling random minds from the vast space of possible minds, while the energetic force (from training) pulls toward minds that behave as we want. The actual outcome depends on which force is stronger.

The MIRI view, I'm pretty sure, is that the force of training does not pull towards minds that behave as we want, unless we know a lot of things about training design we currently don't.

MIRI is not talking about the randomness as in the spread of the training posterior as a function of random Bayesian sampling/NN initialization/SGD noise. The point isn't that training is inherently random. It can be a completely deterministic process without affecting the MIRI argument basically at all. If everything were a Bayesian sample from the posterior and there was a single basin of minimum local learning coefficient corresponding to equivalent implementations of a single algorithm, then I don't think this would by default make models any more likely to be aligned. The simplest fit to the training signal need not be an optimiser pointed at a terminal goal that maps to the training signal in a neat way humans can intuitively zero-shot without figuring out underlying laws. The issue isn't that the terminal goals are somehow fundamentally random- that there is no clear one-to-one mapping from the training setup to the terminal goals. It's that we early 21st century humans don't know the mapping from the training setup to the terminal goals. Having the terminal goals be completely determined by the training criteria does not help us if we don't know what training criteria map to terminal goals that we would like. It's a random draw from a vast space from our[1] perspective because we don't know what we're doing yet.

  1. ^

    Probability and randomness are in the mind, not the territory. MIRI is not alleging that neural network training is somehow bound to strongly couple to quantum noise.

Comment by Lucius Bushnaq (Lblack) on quila's Shortform · 2025-01-05T14:28:05.046Z · LW · GW

My guess is that it's just an effect of field growth. A lot of people coming in now weren't around when the consensus formed and don't agree with it or don't even know much about it.

Also, the consensus wasn't exactly uncontroversial on LW even way back in the day. Hanson's Ems inhabit a somewhat more recognisable world and economy that doesn't have superintelligence in it, and lots of skeptics used to be skeptical in the sense of thinking all of this AI stuff was way too speculative and wouldn't happen for hundreds of years if ever, so they made critiques of that form or just didn't engage in AI discussions at all. LW wasn't anywhere near this AI-centric when I started reading it around 2010. 

Comment by Lucius Bushnaq (Lblack) on Lucius Bushnaq's Shortform · 2025-01-03T04:47:06.008Z · LW · GW

That's what I meant by

If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the  value will still be conserved under gradient descent so long as we're inside that region.

...

One could maybe hold out hope that the conserved quantities/coordinates associated with degrees of freedom in a particular solution are sometimes more interesting, but I doubt it. For e.g. the degrees of freedom we talk about here, those invariants seem similar to the ones in the ReLU rescaling example above.

Dead neurons are a special case of 3.1.1 (low-dimensional activations) in that paper, bypassed neurons are a special case of 3.2 (synchronised non-linearities). Hidden polytopes are a mix 3.2.2 (Jacobians spanning a low-dimensional subspace) and 3.1.1 I think. I'm a bit unsure which one because I'm not clear on what weight direction you're imagining varying when you talk about "moving the vertex". Since the first derivative of the function you're approximating doesn't actually change at this point, there's multiple ways you could do this.

Comment by Lucius Bushnaq (Lblack) on Lucius Bushnaq's Shortform · 2025-01-02T20:20:51.708Z · LW · GW

Minor detail, but this is false in practice because we are doing gradient descent with a non-zero learning rate, so there will be some diffusion between different hyperbolas in weight space as we take gradient steps of finite size.

See footnote 1.

Comment by Lucius Bushnaq (Lblack) on Lucius Bushnaq's Shortform · 2025-01-02T11:10:15.350Z · LW · GW

PSA: The conserved quantities associated with symmetries of neural network loss landscapes seem mostly boring.

If you’re like me, then after you heard that neural network loss landscapes have continuous symmetries, you thought: “Noether’s theorem says every continuous symmetry of the action corresponds to a conserved quantity, like how energy and momentum conservation are implied by translation symmetry and angular momentum conservation is implied by rotation symmetry. Similarly, if loss functions of neural networks can have continuous symmetries, these ought to be associated with quantities that stay conserved under gradient descent[1]!” 

This is true. But these conserved quantities don’t seem to be insightful the way energy and momentum in physics are. They basically turn out to just be a sort of coordinate basis for the directions along which the loss is flat. 

If our network has a symmetry such that there is an abstract coordinate  along which we can vary the parameters without changing the loss, then the gradient with respect to that coordinate will be zero. So, whatever  value we started with from random initalisation will be the value we stay at. Thus, the  value is a “conserved quantity” under gradient descent associated with the symmetry. If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the  value will still be conserved under gradient descent so long as we're inside that region.

For example, let's look at a simple global symmetry: In a ReLU network, we can scale all the weights going into a neuron by some positive constant , and scale all the weights going out of the neuron by , without changing what the network is doing. So, if we have a neuron with one ingoing weight  initalised to  and one outgoing weight  initalised to , then the weight gradient in the direction  of those two weights will be zero. Meaning our network will keep having  all throughout training. If we’d started from a different initalisation, like , we’d instead have zero weight gradient along the direction . So whatever hyperbola defined by  we start on, we'll stay on it throughout training, assuming no fancy add-ons like weight decay.[2]

If this doesn’t seem very insightful, I think that's because it isn’t. It might be useful to keep in mind for bookkeeping purposes if you’re trying to do some big calculation related to learning dynamics, but it doesn’t seem to yield much insight into anything to do with model internals on the conceptual level. One could maybe hold out hope that the conserved quantities/coordinates associated with degrees of freedom in a particular solution are sometimes more interesting, but I doubt it. For e.g. the degrees of freedom we talk about here, those invariants seem similar to the ones in the ReLU rescaling example above.

I’d guess this is because in physics, different starting values of conserved quantities often correspond to systems with very different behaviours, so they contain a lot of relevant information. A ball of gas with high energy and high angular momentum behaves very differently than a ball of gas with low energy and low angular momentum. Whereas adjacent neural network parameter configurations connected by some symmetry that get the same loss correspond precisely to models that behave basically the same way. 

I'm writing this up so next time someone asks me about investigating this kind of thing, I'll have something to link them to.

  1. ^

    Well, idealised gradient descent where learning rates are infinitesimally small, at least.

  2. ^

    See this paper which Micurie helpfully linked me. Also seems like a good resource in general if you find yourself needing to muck around with these invariants for some calculation.

Comment by Lucius Bushnaq (Lblack) on The Plan - 2024 Update · 2025-01-02T08:17:15.487Z · LW · GW

Do you have some concrete example of a technique for which this applies?

Comment by Lucius Bushnaq (Lblack) on The Plan - 2024 Update · 2025-01-02T07:34:19.891Z · LW · GW

An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string "kill all the humans". My terminal-ish values are mostly over the physical stuff.

I don't understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That's way too advanced for where we're at. We're more at the stage of asking questions like "Is the learning coefficient of a network composed of  independent superposed circuits equal to the learning coefficients of the individual  circuits summed, or greater?"

The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.

Comment by Lucius Bushnaq (Lblack) on The Plan - 2024 Update · 2025-01-01T12:00:00.366Z · LW · GW

But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word "human" and the word "dog" as characters on your screen? For most of what interp is currently trying to do, it seems to me that the underlying domain the network learns to model doesn't matter that much. I wouldn't want to make up some completely fake domain from scratch, because the data distribution of my fake domain might qualitatively differ from the sorts of domains a serious AI would need to model. And then maybe the network we get works genuinely differently than networks that model real domains, so our research results don't transfer. But image generation and internet token prediction both seem very entangled with the structure of the universe, so I'd expect them both to have the right sort of high level structure and yield results that mostly transfer. 

On the other hand, if I'm doing interp work on an image generator... I'm forced to start lower-level, so by the time I'm working with things like humans and dogs I've already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs

For this specific purpose, I agree with you that language models seem less suitable. And if this is what you're trying to tackle directly at the moment I can see why you would want to use domains like image generation and fluid simulation for that, rather than internet text prediction. But I think there's good attack angles on important problems in interp that don't route through investigating this sort of question as one of the first steps. 

Comment by Lucius Bushnaq (Lblack) on The Plan - 2024 Update · 2025-01-01T07:06:53.185Z · LW · GW

Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that's not what we're talking about here.

I suppose my confusion might be related to this part. Why are tokens embedded in the physical world only in a "trivial" sense? I don't really see how the laws and heuristics of predicting internet text are in a different category from the laws of creating images of cars for the purposes we care about when doing interpretability. 

I guess you could tell a story where looking into a network that does internet next-token prediction might show you both the network's high-level concepts created over high-level statistical patterns of tokens, and human high-level concepts as the low-level tokens and words themselves, and an interpretability researcher who is not thinking carefully might risk confusing themselves by mixing the two up. But while that story may sound sort of plausible when described in the abstract like that, it doesn't actually ring true to me. For the kind of work most people are doing in interpretability right now, I can't come up with a concrete instantiation of this abstract failure mode class that I'd actually be concerned about. So, at the moment, I'm not paying this much mind compared to other constraints when picking what models to look at.

Does the above sound like I'm at least arguing with your thesis, or am I guessing wrong on what class of failure modes you are even worried about?

Comment by Lucius Bushnaq (Lblack) on The Plan - 2024 Update · 2024-12-31T20:03:49.063Z · LW · GW

And for that purpose, we want to start as “close to the metal” as possible. We definitely do not want our lowest-level data to be symbolic strings, which are themselves already high-level representations far removed from the environment we’re trying to understand.

I don't understand the intuition here. In what sense are patterns of pixels “closer to the metal” than patterns of tokens? Why can't our low-level data be bigrams, trigrams, questions, and Monty Python references instead of edges, curves, digits and dog fur? What's the difference?

Comment by Lucius Bushnaq (Lblack) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T17:59:35.741Z · LW · GW

(I have not maintained this list in many months, sorry.)

Comment by Lucius Bushnaq (Lblack) on What Have Been Your Most Valuable Casual Conversations At Conferences? · 2024-12-25T07:29:24.542Z · LW · GW

Some casual conversations with strangers that were high instrumental value:

At my first (online) LessWrong Community Weekend in 2020, I happened to chat with Linda Linsefors. That was my first conversation with anyone working in AI Safety. I’d read about the alignment problem for almost a decade at that point and thought it was the most important thing in the world, but I’d never seriously considered working on it. MIRI had made it pretty clear that the field only needed really exceptional theorists, and I didn’t think I was one of those. That conversation with Linda started the process of robbing me of my comfortable delusions on this front. What she said made it seem more like the field was pretty inadequate, and perfectly normal theoretical physicists could maybe help just by applying the standard science playbook for figuring out general laws in a new domain. Horrifying. I didn't really believe it yet, but this conversation was a factor in me trying out AI Safety Camp a bit over a year later.

At my first EAG, I talked to someone who was waiting for the actual event to begin along with me. This turned out to be Vivek Hebbar, who I'd never heard of before. We got to talking about inductive biases of neural networks. We kept chatting about this research area sporadically for a few weeks after the event. Eventually, Vivek called me to talk about the idea that would become this post. Thinking about that idea led to me understanding the connection between basin broadness and representation dimensionality in neural networks, which ultimately resulted in this research. It was probably the most valuable conversation I’ve had at any EAG so far, and it was unplanned.

At my second EAG, someone told me that an idea for comparing NN representations I’d been talking to them about already existed, and was called centred kernel alignment. I don’t quite remember how that conversation started, but I think it might have been a speed friending event.

My first morning in the MATS kitchen area in Berkeley, someone asked me if I’d heard about a thing called Singular Learning Theory. I had not. He went through his spiel on the whiteboard. He didn’t have the explanation down nearly as well back then, but it still very recognisably connected to how I’d been thinking about NN generalisation and basin broadness, so I kept an eye on the area.

Comment by Lucius Bushnaq (Lblack) on Habryka's Shortform Feed · 2024-12-21T07:24:32.264Z · LW · GW

I did have a pretty strong expectation of privacy for LW DMs. That was probably dumb of me.

This is not due to any explicit or implicit promise by the mods or the site interface I can recall. I think I was just automatically assuming that strong DM privacy would be a holy principle on a forum with respectable old-school internet culture around anonymity and privacy. This wasn’t really an explicitly considered belief. It just never occurred to me to question this. Just like I assume that doxxing is probably an offence that can result in an instant ban, even though I never actually checked the site guidelines on that.

The site is not responsible for my carelessness on this, but if there was an attention-grabbing box in the DM interface making it clear that mods do look at DMs and DM metadata under some circumstances that fall short of a serious criminal investigation or an apocalypse, I would have appreciated that.

Comment by Lucius Bushnaq (Lblack) on Dress Up For Secular Solstice · 2024-12-20T18:54:39.772Z · LW · GW

Single datapoint, but: I find outside restrictions on my appearance deeply unpleasant. I avoid basically all events and situations with a mandated dress code when this is at all feasible. So if a solstice has a dress code, I will not be attending it.

Comment by Lucius Bushnaq (Lblack) on Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models · 2024-12-16T22:00:11.379Z · LW · GW

Although in contrast to (Ramesh et al. (2018) and my work, that paper only considers the Jacobian of a shallow rather than deep slice.

We also tried using the Jacobians between every layer and the final layer, instead of the Jacobians between adjacent layers. This is what we call "global interaction basis" in the paper. It didn't change the results much.

Comment by Lucius Bushnaq (Lblack) on Frontier Models are Capable of In-context Scheming · 2024-12-07T12:22:19.197Z · LW · GW

Seems like some measure of evidence -- maybe large, maybe tiny -- that "We don't know how to give AI values, just to make them imitate values" is false?

I am pessimistic about loss signals getting 1-to-1 internalised as goals or desires in a way that is predictable to us with our current state of knowledge on intelligence and agency, and would indeed tentatively consider this observation a tiny positive update.

Comment by Lucius Bushnaq (Lblack) on leogao's Shortform · 2024-12-03T23:00:25.534Z · LW · GW

I do not find this to be the biggest value-contributor amongst my spontaneous conversations.

I don't have a good hypothesis for why spontaneous-ish conversations can end up being valuable to me so frequently. I have a vague intuition that it might be an expression of the same phenomenon that makes slack and playfulness in research and internet browsing very valuable for me.

Comment by Lucius Bushnaq (Lblack) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-12-01T09:54:42.337Z · LW · GW

The donation site said I should leave a comment here if I donate, so I'm doing that. Gave $200 for now. 

I was in Lighthaven for the Illiad conference. It was an excellent space. The LessWrong forum feels like what some people in the 90s used to hope the internet would be.

Edit 23.12.2024: $400 more donated by me since the original message.

Comment by Lucius Bushnaq (Lblack) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T18:06:47.874Z · LW · GW

There currently doesn't really exist any good way for people who want to contribute to AI existential risk reduction to give money in a way that meaningfully gives them assistance in figuring out what things are good to fund. This is particularly sad since I think there is now a huge amount of interest from funders and philanthropists who want to somehow help with AI x-risk stuff, as progress in capabilities has made work in the space a lot more urgent, but the ecosystem is currently at a particular low-point in terms of trust and ability to direct that funding towards productive ends.

Really? What's the holdup here exactly? How is it still hard to give funders a decent up-to-date guide to the ecosystem, or a knowledgeable contact person, at this stage? For a workable budget version today, can't people just get a link to this and then contact orgs they're interested in?

Comment by Lucius Bushnaq (Lblack) on Lucius Bushnaq's Shortform · 2024-11-30T12:26:45.960Z · LW · GW

Two shovel-ready theory projects in interpretability.

Most scientific work isn't "shovel-ready." It's difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.

Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results. 

Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop during training.

This post has a framework for compressing lots of small residual MLPs into one big residual MLP. Both projects are about improving this framework.

1) I think the framework can probably be pretty straightforwardly extended to transformers. This would help make the theory more directly applicable to language models. The key thing to show there is how to do superposition in attention. I suspect you can more or less use the same construction the post uses, with individual attention heads now playing the role of neurons. I put maybe two work days into trying this before giving it up in favour of other projects. I didn’t run into any notable barriers, the calculations just proved to be more extensive than I’d hoped they’d be. 

2) Improve error terms for circuits in superposition at finite width. The construction in this post is not optimised to be efficient at finite network width. Maybe the lowest hanging fruit to improving it is changing the hyperparameter , the probability with which we connect a circuit to a set of neurons in the big network. We set  in the post, where  is the MLP width of the big network and  is the minimum neuron count per layer the circuit would need without superposition. The  choice here was pretty arbitrary. We just picked it because it made the proof easier. Recently, Apollo played around a bit with superposing very basic one-feature circuits into a real network, and IIRC a range of  values seemed to work ok. Getting tighter bounds on the error terms as a function of  that are useful at finite width would be helpful here. Then we could better predict how many circuits networks can superpose in real life as a function of their parameter count. If I was tackling this project, I might start by just trying really hard to get a better error formula directly for a while. Just crunch the combinatorics. If that fails, I’d maybe switch to playing more with various choices of  in small toy networks to develop intuition. Maybe plot some scaling laws of performance with  at various network widths in 1-3 very simple settings. Then try to guess a formula from those curves and try to prove it’s correct.

Another very valuable project is of course to try training models to do computation in superposition instead of hard coding it. But Stefan mentioned that one already.
 

  1. ^

    1 Boolean computations in superposition LW post. 2 Boolean computations paper of LW post with more worked out but some of the fun stuff removed. 3 Some proofs about information-theoretic limits of comp-sup. 4 General circuits in superposition LW post. If I missed something, a link would be appreciated.

Comment by Lucius Bushnaq (Lblack) on StefanHex's Shortform · 2024-11-20T14:18:25.549Z · LW · GW

Agreed. I do value methods being architecture independent, but mostly just because of this: 

and maybe a sign that a method is principled

At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implementation on one particular architecture.