Posts

Charbel-Raphaël and Lucius discuss Interpretability 2023-10-30T05:50:34.589Z
Announcing Apollo Research 2023-05-30T16:17:19.767Z
Basin broadness depends on the size and number of orthogonal features 2022-08-27T17:29:32.508Z
What Is The True Name of Modularity? 2022-07-01T14:55:12.446Z
Ten experiments in modularity, which we'd like you to run! 2022-06-16T09:17:28.955Z
Project Intro: Selection Theorems for Modularity 2022-04-04T12:59:19.321Z
Theories of Modularity in the Biological Literature 2022-04-04T12:48:41.834Z
Welcome to the SSC Dublin Meetup 2020-07-30T18:56:36.627Z

Comments

Comment by Lucius Bushnaq (Lblack) on How to safely use an optimizer · 2024-03-28T23:37:40.645Z · LW · GW

Typo fixed, thanks.

Comment by Lucius Bushnaq (Lblack) on How to safely use an optimizer · 2024-03-28T23:07:35.033Z · LW · GW

Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure
 

Your example has it be an important bit though. What database to use. Not a random bit. If I'm getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup. 

 doesn't mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].

  1. ^

    Plus the empty string for not answering.

Comment by Lucius Bushnaq (Lblack) on How to safely use an optimizer · 2024-03-28T18:07:19.862Z · LW · GW

First thought: The Oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.

Maybe we can stop that by scrapping every Oracle that doesn't answer and training a new one with presumably new goals? Or would the newly trained Oracles just cooperate with the former dead ones in one long-term plan to break out, take control, and reward all the dead Oracles created on the way with utility?

Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they've been satisficed? Are they the kind of problems where solving them can save the world? 

It feels to me like the answer is 'yes'.  A lot of core research that would allow e.g. for brain augmentation seems like they'd be in that category. But my inner John Wentworth sim is looking kind of sceptical.

 

  1. ^

    It also gets to choose the timing of its answer, but I assume we are not being idiots about that and setting the output channel to always deliver results after a set time , no more and no less.

Comment by Lucius Bushnaq (Lblack) on Some costs of superposition · 2024-03-10T21:27:54.551Z · LW · GW

I think the  may be in there because JL is putting an upper bound on the interference, rather than describing the typical interference of two features. As you increase  (more features), it becomes more difficult to choose feature embeddings such that no features have high interference with any other features.

So its not really the 'typical' noise between any two given features, but it might be the relevant bound for the noise anyway? Not sure right now which one matters more for practical purposes.

Comment by Lucius Bushnaq (Lblack) on story-based decision-making · 2024-03-05T13:46:43.543Z · LW · GW

How does that make you feel about the chances of the rebels destroying the Death Star? Do you think that the competent planning being displayed is a good sign? According to movie logic, it's a really bad sign.

Even in the realm of movie logic, I always thought the lack of backup plans was supposed to signal how unlikely the operation is to work, so as to create at least some instinctive tension in the viewer when they know perfectly well that this isn't the kind of movie that realistically ends with the Death Star blowing everyone up. In fact, these scenes usually have characters directly stating how nigh-impossible the mission is.

To the extent that the presence of backup plans make me worried, it's because so many movies have pulled this cheap trick that my brain now associates the presence of backup plans with the more uncommon kind of story that attempts to work a little like real life, so things won't just magically work out and the Death Star really might blow everyone up.

Comment by Lucius Bushnaq (Lblack) on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T15:47:17.983Z · LW · GW

I feel like 'LeastWrong' implies a focus on posts judged highly accurate or predictive in hindsight, when in reality I feel like the curation process tends to weigh originality, depth and general importance a lot as well, with posts regarded by the community as 'big if true' often being held in high regard.

Comment by Lucius Bushnaq (Lblack) on The Hidden Complexity of Wishes · 2024-02-23T22:10:08.109Z · LW · GW

I figured the probability adjustments the pump was making were modifying Everett branch amplitude ratios. Not probabilities as in reasoning tools to deal with incomplete knowledge of the world and logical uncertainty that tiny human brains use to predict how this situation might go based on looking at past 'base rates'. It's unclear to me how you could make the latter concept of an outcome pump a coherent thing at all. The former, on the other hand, seems like the natural outcome of the time machine setup described. If you turn back time when the branch doesn't have the outcome you like, only branches with the outcome you like will remain.

I can even make up a physically realisable model of an outcome pump that acts roughly like the one described in the story without using time travel at all. You just need a bunch of high quality sensors to take in data, an AI that judges from the observed data whether the condition set is satisfied, a tiny quantum random noise generator to respect the probability orderings desired, and a false vacuum bomb, which triggers immediately if the AI decides that the condition does not seem to be satisfied. The bomb works by causing a local decay of the metastable[1] electroweak vacuum. This is a highly energetic, self-sustaining process once it gets going, and spreads at the speed of light. Effectively destroying the entire future light-cone, probably not even leaving the possibility for atoms and molecules to ever form again in that volume of space.[2]

So when the AI triggers the bomb or turns back time, the amplitude of earth in that branch basically disappears. Leaving the users of the device to experience only the branches in which the improbable thing they want to have happen happens.

And causing a burning building with a gas supply in it to blow up strikes me as something you can maybe do with a lot less random quantum noise than making your mother phase through the building. Firefighter brains are maybe comparatively easy to steer with quantum noise as well, but that only works if there are any physically nearby enough to reach the building in time to save your mother at the moment the pump is activated. 

This is also why the pump has a limit on how improbable an event it can make happen. If the event has an amplitude of roughly the same size as the amplitude for the pump's sensors reporting bad data or otherwise causing the AI to make the wrong call, the pump will start being unreliable. If the event's amplitude is much lower than the amplitude for the pump malfunctioning, it basically can't do the job at all.

  1. ^

    In real life, it was an open question whether our local electroweak vacuum is in a metastable state last I checked, with the latest experimental evidence I'm aware from a couple of years ago tentatively (ca. 3 sigma I think?) pointing to yes, though that calculation is probably assuming Standard model physics the applicability of which people can argue to hell and back. But it sure seems like a pretty self-consistent way for the world to be, so we can just declare that the fictional universe works like that. Substitute strangelets or any other conjectured instant-earth-annihilation-method of your choice if you like.

  2. ^

    Because the mass terms for the elementary quantum fields would look all different now. Unclear to me that the bound structures of hadronic matter we are familiar with would still be a thing. 

Comment by Lucius Bushnaq (Lblack) on Toward A Mathematical Framework for Computation in Superposition · 2024-02-09T14:16:39.238Z · LW · GW

Thinking the example through a bit further: In a ReLU layer, features are all confined to the positive quadrant. So superposed features computed in a ReLU layer all have positive inner product. So if I send the output of one ReLU layer implementing  AND gates in superposition directly to another ReLU layer implementing another  ANDs on a subset of the outputs of that previous layer[1], the assumption that input directions are equally likely to have positive and negative inner products is not satisfied.

Maybe you can fix this with bias setoffs somehow? Not sure at the moment. But as currently written, it doesn't seem like I can use the outputs of one layer performing a subset of ANDs as the inputs of another layer performing another subset of ANDs.

EDIT: Talked it through with Jake. Bias setoff can help, but it currently looks to us like you still end up with AND gates that share a variable systematically having positive sign in their inner product. Which might make it difficult to implement a valid general recipe for multi-step computation if you try to work out the details.

  1. ^

    A very central use case for a superposed boolean general computer. Otherwise you don't actually get to implement any serial computation.

Comment by Lucius Bushnaq (Lblack) on On the Debate Between Jezos and Leahy · 2024-02-06T23:43:43.251Z · LW · GW

Noting out loud that I'm starting to feel a bit worried about the culture-war-like tribal conflict dynamic between AIS/LW/EA and e/acc circles that I feel is slowly beginning to set in on our end as well, centered on Twitter but also present to an extent on other sites and in real life. The potential sanity damage to our own community and possibly future AI policy from this should it intensify is what concerns me most here.

People have tried to suck the rationalist diaspora into culture-war-like debates before, and I think the diaspora has done a reasonable enough job of surviving intact by not taking the bait much. But on this topic, many of us actually really care about both the content of the debate itself and what people outside the community think of it, and I fear it is making us more vulnerable to the algorithms' attempts to infect us than we have been in the past.

I think us going out of our way to keep standards high in memetic public spaces might possibly help some in keeping our own sanity from deteriorating. If we engage on Twitter, maybe we don't just refrain from lowering the level of debate and using arguments as soldiers but try to have a policy of actively commenting to correct the record when people of any affiliation make locally-invalid arguments against our opposition if we would counterfactually also correct the record were such a locally-invalid argument directed against us or our in-group. I think high status and high Twitter/Youtube-visible community members' behavior might end up having a particularly high impact on the eventual outcome here.

Comment by Lucius Bushnaq (Lblack) on Toward A Mathematical Framework for Computation in Superposition · 2024-02-06T12:54:07.497Z · LW · GW

Having digested this a bit more, I've got a question regarding the noise terms, particularly for section 1.3 that deals with constructing general programs over sparse superposed variables.

Unfortunately, since the  are random vectors, their inner product will have a typical size of . So, on an input which has no features connected to neuron , the preactivation for that neuron will not be zero: it will be a sum of these interference terms, one for each feature that is connected to the neuron. Since the interference terms are uncorrelated and mean zero, they start to cause neurons to fire incorrectly when  neurons are connected to each neuron. Since each feature is connected to each neuron with probability  this means neurons start to misfire when [13].

It seems to me that the assumption of uncorrelated errors here is rather load-bearing. If you don't get uncorrelated errors over the inputs you actually care about, you are forced to scale back to connecting only  features to every neuron, correct? And the same holds for the construction right after this one, and probably most of the other constructions shown here?

And if you only get  connected features per neuron, you scale back to only being able to compute  arbitrary AND gates per layer, correct?

Now, the reason these errors are 'uncorrelated' is that the features were embedded as random vectors in our layer space. In other words, the distributions over which they are uncorrelated is the distribution of feature embeddings and sets of neurons chosen to connect to particular features. So for any given network, we draw from this distribution only once, when the weights of the network are set, and then we are locked into it.

So this noise will affect particular sets of inputs strongly, systematically, in the same direction every time. If I divide the set of features into two sets, where features in each half are embedded along directions that have a positive inner product with each other[1], I can't connect more than  from the same half to the same neuron without making it misfire, right? So if I want to implement a layer that performs  ANDs on exactly those features that happen to be embedded within the same set, I can't really do that. Now, for any given embedding, that's maybe only some particular sets of features which might not have much significance to each other. But then the embedding directions of features in later layers depend on what was computed and how in the earlier layers, and the limitations on what I can wire together apply every time. 

I am a bit worried that this and similar assumptions about stochasticity here might turn out to prevent you from wiring together the features you need to construct arbitrary programs in superposition, with 'noise' from multiple layers turning out to systematically interact in exactly such a way as to prevent you from computing too much general stuff. Not because I see a gears-level way this could happen right now, but because I think rounding off things to 'noise' that are actually systematic is one of these ways an exciting new theory can often go wrong and see a structure that isn't there, because you are not tracking the parts of the system that you have labeled noise and seeing how the systematics of their interactions constrain the rest of the system. 

Like making what seems like a blueprint for perpetual motion machine because you're neglecting to model some small interactions with the environment that seem like they ought not to affect the energy balance on average, missing how the energy losses/gains in these interactions are correlated with each other such that a gain at one step immediately implies a loss in another.

Aside from looking at error propagation more, maybe a way to resolve this might be to switch over to thinking about one particular set of weights instead of reasoning about the distribution the weights are drawn from? 

  1. ^

    E.g. pick some hyperplanes and declare everything on one side of all of them to be the first set.

Comment by Lucius Bushnaq (Lblack) on Welcome to the SSC Dublin Meetup · 2024-02-05T18:20:47.016Z · LW · GW

Update February 2024: I left Ireland over a year ago, and the group is probably dead now, unfortunately. There's still an EA group around, which as of this writing seems quite active.

Comment by Lucius Bushnaq (Lblack) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T15:15:15.433Z · LW · GW

If the SAEs are not full-distribution competitive, I don't really trust that the features they're seeing are actually the variables being computed on in the sense of reflecting the true mechanistic structure of the learned network algorithm and that the explanations they offer are correct[1]. If I pick a small enough sub-distribution, I can pretty much always get perfect reconstruction no matter what kind of probe I use, because e.g. measured over a single token the network layers will have representation rank , and the entire network can be written as a rank- linear transform. So I can declare the activation vector at layer  to be the active "feature", use the single entry linear maps between SAEs to "explain" how features between layers map to each other, and be done. Those explanations will of course be nonsense and not at all extrapolate out of distributon. I can't use them to make a causal model that accurately reproduces the network's behavior or some aspect of it when dealing with a new prompt.

We don't train SAEs on literally single tokens, but I would be worried about the qualitative problem persisting. The network itself doesn't have a million different algorithms to perform a million different narrow subtasks. It has a finite description length. It's got to be using a smaller set of general algorithms that handle all of these different subtasks, at least to some extent. Likely more so for more powerful and general networks. If our "explanations" of the network then model it in terms of different sets of features and circuits for different narrow subtasks that don't fit together coherently to give a single good reconstruction loss over the whole distribution, that seems like a sign that our SAE layer activations didn't actually capture the general algorithms in the network. Thus, predictions about network behaviour made on the basis of inspecting causal relationships between these SAE activations might not be at all reliable, especially predictions about behaviours like instrumental deception which might be very mechanistically related to how the network does well on cross-domain generalisation.

  1. ^

    As in, that seems like a minimum requirement for the SAEs to fulfil. Not that this would be  to make me trust predictions about generalisation based on stories about SAE activations.

Comment by Lucius Bushnaq (Lblack) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T10:16:05.485Z · LW · GW

Our reconstruction scores were pretty good. We found GPT2 small achieves a cross entropy loss of about 3.3, and with reconstructed activations in place of the original activation, the CE Log Loss stays below 3.6. 

Unless my memory is screwing up the scale here, 0.3 CE Loss increase seems quite substantial? A 0.3 CE loss increase on the pile is roughly the difference between Pythia 410M and Pythia 2.8B. And do I see it right that this is the CE increase maximum for adding in one SAE, rather than all of them at the same time? So unless there is some very kind correlation in these errors where every SAE is failing to reconstruct roughly the same variance, and that variance at early layers is not used to compute the variance SAEs at later layers are capturing, the errors would add up? Possibly even worse than linearly? What CE loss do you get then?

Have you tried talking to the patched models a bit and compared to what the original model sounds like? Any discernible systematic differences in where that CE increase is changing the answers?

Comment by Lucius Bushnaq (Lblack) on Making every researcher seek grants is a broken model · 2024-01-26T21:10:25.661Z · LW · GW

Can someone destroy my hope early by giving me the Molochian reasons why this change hasn't been made already and never will be?

Comment by Lucius Bushnaq (Lblack) on This might be the last AI Safety Camp · 2024-01-25T11:47:44.568Z · LW · GW

MATS has steadily increased in quality over the past two years, and is now more prestigious than AISC. We also have Astra, and people who go directly to residencies at OpenAI, Anthropic, etc. One should expect that AISC doesn't attract the best talent.

  • If so, AISC might not make efficient use of mentor / PI time, which is a key goal of MATS and one of the reasons it's been successful.

AISC isn't trying to do what MATS does. Anecdotal, but for me, MATS could not have replaced AISC (spring 2022 iteration). It's also, as I understand it, trying to have a structure that works without established mentors, since that's one of the large bottlenecks constraining the training pipeline.

Also, did most of the past camps ever have lots of established mentors? I thought it was just the one in 2022 that had a lot? So whatever factors made all the past AISCs work and have participants sing their praises could just still be there.

Why does the founder, Remmelt Ellen, keep posting things described as "content-free stream of consciousness", "the entire scientific community would probably consider this writing to be crankery", or so obviously flawed it gets -46 karma? This seems like a concern especially given the philosophical/conceptual focus of AISC projects, and the historical difficulty in choosing useful AI alignment directions without empirical grounding.

He was posting cranky technical stuff during my camp iteration too. The program was still fantastic. So whatever they are doing to make this work seems able to function despite his crankery. With a five year track record, I'm not too worried about this factor.

All but 2 of the papers listed on Manifund as coming from AISC projects are from 2021 or earlier.

In the first link at least, there are only eight papers listed in total though.  With the first camp being in 2018, it doesn't really seem like the rate dropped much? So to the extent you believe your colleagues that the camp used to be good, I don't think the publication record is much evidence that it isn't anymore. Paper production apparently just does not track the effectiveness of the program much. Which doesn't surprise me, I don't think the rate of paper producion tracks the quality of AIS research orgs much either.

The impact assessment was commissioned by AISC, not independent. They also use the number of AI alignment researchers created as an important metric. But impact is heavy-tailed, so the better metric is value of total research produced. Because there seems to be little direct research, to estimate the impact we should count the research that AISC alums from the last two years go on to produce. Unfortunately I don't have time to do this.

Agreed on the metric being not great, and that an independently commissioned report would be better evidence (though who would have comissioned it?). But ultimately, most of what this report is apparently doing is just asking a bunch of AIS alumni what they thought of the camp and what they were up to, these days.  And then noticing that these alumni often really liked it and have apparently gone on to form a significant fraction of the ecosystem. And I don't think they even caught everyone. IIRC our AISC follow-up LTFF grant wasn't part of the spreadsheets until I wrote Remmelt that it wasn't there. 

I am not surprised by this. Like you, my experience is that most of my current colleagues who were part of AISC tell me it was really good. The survey is just asking around and noticing the same. 
 

I was the private donor who gave €5K. My reaction to hearing that AISC was not getting funding was that this seemed insane. The iteration I was in two years ago was fantastic for me, and the research project I got started on there is basically still continuing at Apollo now. Without AISC, I think there's a good chance I would never have become an AI notkilleveryoneism researcher. 

It feels like a very large number of people I meet in AIS today got their start in one AISC iteration or another, and many of them seem to sing its praises. I think 4/6 people currently on our interp team were part of one of the camps. I am not aware of any other current training program that seems to me like it would realistically replace AISC's role, though I admittedly haven't looked into all of them. I haven't paid much attention to the iteration that happened in 2023, but I happen to know a bunch of people who are in the current iteration and think trying to run a training program for them is an obviously good idea. 

I think MATS and co. are still way too tiny to serve all the ecosystem's needs, and under those circumstances, shutting down a training program with an excellent five year track record seems like an even more terrible idea than usual. On top of that, the research lead structure they've been trying out for this camp and the last one seems to me like it might have some chance of being actually scalable. I haven't spend much time looking at the projects for the current iteration yet, but from very brief surface exposure they didn't seem any worse on average than the ones in my iteration. Which impressed and surprised me, because these projects were not proposed by established mentors like the ones in my iteration were.  A far larger AISC wouldn't be able to replace what a program like MATS does, but it might be able to do what AISC6 did for me, and do it for far more people than anything structured like MATS realistically ever could. 

On a more meta point, I have honestly not been all that impressed with the average competency of the AIS funding ecosystem. I don't think it not funding a project is particularly strong evidence that the project is a bad idea. 

Comment by Lucius Bushnaq (Lblack) on Toward A Mathematical Framework for Computation in Superposition · 2024-01-18T23:30:39.112Z · LW · GW

Well. Damn. 

As a vocal critic of the whole concept of superposition, this post has changed my mind a lot. An actual mathematical definition that doesn't depend on any fuzzy notions of what is 'human interpretable', and a start on actual algorithms for performing general, useful computation on overcomplete bases of variables.

Everything I've read on superposition before this was pretty much only outlining how you could store and access lots of variables from a linear space with sparse encoding, which isn't exactly a revelation. Every direction is a float, so of course the space can store about float precision to the -th power different states, which you can describe as superposed sparse features if you like. But I didn't need to use that lens to talk about the compression. I could just talk about good old non-overcomplete linear algebra bases instead. The  basis vectors in that linear algebra description being the compositional summary variables the sparse inputs got compressed into. If basically all we can do with the 'superposed variables' is make lookup tables of them, there didn't seem to me to be much need for the concept at all to reverse engineer neural networks. Just stick with the summary variables, summarising is what intelligence is all about.

If we can do actual,  general computation with the sparse variables? Computations with internal structure that we can't trivially describe just as well using  floats forming the non-overcomplete linear basis of a vector space? Well, that would change things. 


As you note, there's certainly work left to do here on the error propagation and checking for such algorithms in real networks. But even with this being an early proof of concept, I do now tentatively expect that better-performing implementations of this probably exist. And if such algorithms are possible, they sure do sound potentially extremely useful for an LLM's job. 

On my previous superposition-skeptical models, frameworks like the one described in this post are predicted to be basically impossible. Certainly way more cumbersome than this looks. So unless these ideas fall flat when more research is done on the error tolerance, I guess I was wrong. Oops.

Comment by Lucius Bushnaq (Lblack) on Are we inside a black hole? · 2024-01-07T11:02:53.721Z · LW · GW

I think the idea expressed in the post is for our entire observable universe to be a remnant of such spaghettificiation in higher dimensions, with basically no thickness along the direction leading to the singularity remaining. So whatever higher dimensional bound structure the local quantum fields may or may not usually be arranged in is (mostly) gone, and the merely 3+1 dimensional structures of atoms and pelvises we are used to are the result.

I wouldn't know off the top of my head if you can make this story mathematically self-consistent or not. 

Comment by Lucius Bushnaq (Lblack) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-04T09:30:00.939Z · LW · GW

Maybe a⊕b is represented “indicentally” because NN representations are high-dimensional with lots of stuff represented by chance

This would be my first guess, conditioned on the observation being real, except strike “by chance”. The model likely wants to form representations that can serve to solve a very wide class of prediction tasks over the data with very few non-linearities used, ideally none, as in a linear probe. That’s pretty much the hallmark of a good general representation you can use for many tasks.

I thus don't think that comparing to a model with randomized weights is a good falsification. I wouldn’t expect a randomly initialized model to have nice general representations. 

My stated hypothesis here would then predict that the linear probes for XOR features get progressively worse if you apply them to earlier layers. Because the model hasn't had time to make the representation as general that early in the computation. So accuracy should start to drop as you look at layers before fourteen.

I'll also say that if you can figure out a pattern in how particular directions get used as components for many different boolean classification tasks, that seems like the kind of thing that might result in an increased understanding of what these directions encode exactly. What does the layer representation contain, in actual practice, that allows it to do this?

Comment by Lucius Bushnaq (Lblack) on Critical review of Christiano's disagreements with Yudkowsky · 2023-12-29T12:53:05.507Z · LW · GW

Even architectures-in-the-narrow-sense don't show overarching scaling laws at current scales, right? IIRC the separate curves for MLPs, LSTMs and transformers do not currently match up into one larger curve. See e.g. figure 7 here.

So a sudden capability jump due to a new architecture outperforming transformers the way transformers outperform MLPs at equal compute cost seems to be very much in the cards?

I intuitively agree that current scaling laws seem like they might be related in some way to a deep bound on how much you can do with a given amount of data and compute, since different architectures do show qualitatively similar behavior even if the y-axes don't match up. But I see nothing to suggest that any current architectures are actually operating anywhere close to that bound.

Comment by Lucius Bushnaq (Lblack) on Some biases and selection effects in AI risk discourse · 2023-12-13T08:23:59.116Z · LW · GW

If it only requires a simple hack to existing public SOTA, many others will have already thought of said hack and you won't have any additional edge.

I don't recall assuming the edge to be unique? That seems like an unneeded condition for Tamsin's argument, it's enough to believe the field consensus isn't completely efficient by default and all relevant actors are sure of all currently deducable edges at all times.



Progress in DL is completely smooth.

Right, if you think it's completely smooth and thus basically not meaningfully influenced by the actions of individual researchers whatsoever, I see why you would not buy Tamsin's argument here. But then the reason you don't buy it would seem to me to be that you think meaningful new ideas in ML capability research basically don't exist, not because you think there is some symmetric argument to Tamsin's for people to stay quiet about new alignment research ideas.

Comment by Lucius Bushnaq (Lblack) on Some biases and selection effects in AI risk discourse · 2023-12-13T03:00:15.351Z · LW · GW

I don't see why this would be ridiculous. To me, e.g. "Superintelligence only requires [hacky change to current public SOTA] to achieve with expected 2025 hardware, and OpenAI may or may not have realised that already" seems like a perfectly coherent way the world could be, and is plenty of reason for anyone who suspects such a thing to keep their mouth shut about gears-level models of [] that might be relevant for judging how hard and mysterious the remaining obstacles to superintelligence actually are.

Comment by Lucius Bushnaq (Lblack) on Some biases and selection effects in AI risk discourse · 2023-12-13T02:42:27.636Z · LW · GW

It's not that hard to build an AI that saves everyone: you just need to solve [some problems] and combine the solutions. Considering how easy it is compared to what you thought, you should decrease your P(doom) / shorten your timelines.

I'm not sure what you're saying here exactly. It seems to me like you're pointing to a symmetric argument favoring low doom, but if someone had an idea for how to do AI alignment right, why wouldn't they just talk about it? Doesn't seem symmetrical to me.

Comment by Lucius Bushnaq (Lblack) on Speaking to Congressional staffers about AI risk · 2023-12-06T23:07:01.508Z · LW · GW

(I disagree. Indeed, until recently governance people had very few policy asks for government.)

Did that change because people finally finished doing enough basic strategy research to know what policies to ask for? 

It didn't seem like that to me. Instead, my impression was that it was largely triggered by ChatGPT and GPT4 making the topic more salient, and AI safety feeling more inside the Overton window. So there were suddenly a bunch of government people asking for concrete policy suggestions.

Comment by Lucius Bushnaq (Lblack) on Why Yudkowsky is wrong about "covalently bonded equivalents of biology" · 2023-12-06T22:21:27.394Z · LW · GW

"Pandemics" aren't a locally valid substitute step in my own larger argument, because an ASI needs its own manufacturing infrastructure before it makes sense for the ASI to kill the humans currently keeping its computers turned on.

When people are highly skeptical of the nanotech angle yet insist on a concrete example, I've sometimes gone with a pandemic coupled with limited access to medications that temporarily stave off, but don't cure, that pandemic as a way to force a small workforce of humans preselected to cause few problems to maintain the AI's hardware and build it the seed of a new infrastructure base while the rest of humanity dies. 

I feel like this has so far maybe been more convincing and perceived as "less sci-fi" than Drexler-style nanotech by the people I've tried it on (small sample size, n<10).

Generally, I suspect not basing the central example on a position on one side of yet another fierce debate in technology forecasting trumps making things sound less like a movie where the humans might win. The rate of people understanding that something sounding like a movie does not imply the humans have a realistic chance at winning in real life just because they won in the movie seems, in my experience with these conversations so far, to exceed the rate of people getting on board with scenarios that involve any hint of Drexler-style nanotech.

Comment by Lucius Bushnaq (Lblack) on How useful is mechanistic interpretability? · 2023-12-02T12:42:27.954Z · LW · GW

For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
 

I don't have the time and energy to do this properly right now, but here's a few thought experiments to maybe help communicate part of what I mean:

Say you have a transformer model that draws animals.  As in, you type “draw me a giraffe”,  and then it draws you a giraffe. Unknown to you, the way the model algorithm works is that the first thirty layers of the model perform language processing to figure out what you want drawn, and output a summary of fifty scalar variables that the algorithms in the next thirty layers of the model use to draw the animals. And these fifty variables are things like “furriness”, “size”, “length of tail” and so on.

The latter half of the model does then not, in any real sense, think of the concept “giraffe” while it draws the giraffe. It is just executing purely geometric algorithms that use these fifty variables to figure out what shapes to draw. 

If you then point a sparse autoencoder at the residual stream in the latter half of the model, over a data set of people asking the network to draw lots of different animals, far more than fifty or the network width, I’d guess the “sparse features” the SAE finds might be the individual animal types. “Giraffe”, “elephant”, etc. . 

Or, if you make the encoder dictionary larger, more specific sparse features like “fat giraffe” would start showing up. 

And then, some people may conclude that the model was doing a galaxy-brained thing where it was thinking about all of these animals using very little space, compressing a much larger network in which all these animals are variables. This is kind of true in a certain sense if you squint, but pretty misleading. The model at this point in the computation no longer “knows” what a giraffe is. It just “knows” what the settings of furriness, tail length, etc. are right now. If you manually go into the network and set the fifty variables to something that should correspond to a unicorn, the network will draw you a unicorn, even if there were no unicorns in the training data and the first thirty layers in the network don’t know how to set the fifty variables to draw one. So in a sense, this algorithm is more general than a cleverly compressed lookup table of animals would be. And if you want to learn how the geometric algorithms that do the drawing work, what they do with the fifty scalar summary statistics is what you will need to look at.

Just because we can find a transformation that turns an NNs activations into numbers that correlate with what a human observer would regard as separate features of the data, does not mean the model itself is treating these as elementary variables in its own computations in any meaningful sense. 

The only thing the SAE is showing you is that the information present in the model can be written as a sum of some sparsely activating generators of the data. This does not mean that the model is processing the problem in terms of these variables. Indeed, SAE dictionaries are almost custom-selected not to give you variables that a well-generalizing algorithm would use to think about problems with big, complicated state spaces. Good summary variables are highly compositional, not sparse. They can all be active at the same time in any setting, letting you represent the relevant information from a large state space with just a few variables, because they factorise. Temperature and volume are often good summary variables for thinking about thermodynamic systems because the former tells you nothing about the latter and they can co-occur in any combination of values. Variables with strong sparsity conditions on them instead have high mutual information, making them partially redundant, and ripe for compressing away into summary statistics.

If an NN (artificial or otherwise) is, say, processing images coming in from the world, it is dealing with an exponentially large state space. Every pixel can take one of several values. Luckily, the probability distribution of pixels is extremely peaked. The supermajority of pixel settings are TV static that never occurs, and thermal noise that doesn't matter for the NNs task. One way to talk about this highly peaked pixel distribution may be to describe it as a sum of a very large number of sparse generators. The model then reasons about this distribution by compressing the many sparse generators into a small set of pretty non-sparse, highly compositional variables. For example, many images contain one or a few brown branchy structures of a certain kind, which come in myriad variations. The model summarises the presence or absence of any of these many sparse generators with the state of the variable “tree”, which tracks how much the input is “like a tree”.

If the model has a variable “tree” and a variable “size”, the myriad brown, branchy structures in the data might, for example, show up as sparsely encoded vectors in a two-dimensional (“tree”,“size”) manifold. If you point a SAE at that manifold, you may get out sparse activations like “bush” (mid tree, low size) “house” (low tree, high size), “fir” (high tree, high size). If you increase the dictionary size, you might start getting more fine-grained sparse data generators. E.g. “Checkerberry bush” and “Honeyberry bush” might show up as separate, because they have different sizes.

Humans, I expect, work similarly. So the human-like abstractions the model may or may not be thinking in and that we are searching for will not come in the form of sparse generators of layer activations, because human abstractions are the summary variables you would be using to compress these sparse generators. They are the type-of-thing you use to encode a sparse world, not the type-of-thing being encoded. That our SAE is showing us some activations that correlate with information in the input humans regard as meaningful just tells us that the data contains sparse generators humans have conceptual descriptions for, not that the algorithms of the network themselves are encoding the sparse generators using these same human conceptual descriptions. We know it hasn't thrown away the information needed to compute that there was a bush in the image, but we don't know it is thinking in bush. It probably isn't, else bush would not be sparse with respect to the other summary statistics in the layer, and our SAE wouldn't have found it.

 

Comment by Lucius Bushnaq (Lblack) on How useful is mechanistic interpretability? · 2023-12-02T10:44:03.947Z · LW · GW

The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.

This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept. 

Leaving that aside, the vanilla reading of this claim also seems kind of obviously false for many models, otherwise optimising them in inference through e.g. low rank approximation of weight matrices would never work. You are throwing away at least one floating point number worth of description bits there.

I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"

A low-dimensional summary of a variable vector  of size  is a fixed set of random variables  that suffice to summarise the state of .  To summarise the state of  using the activations in an SAE dictionary, I have to describe the state of more than  variables. That these variables are sparse may sometimes let me define an encoding scheme for describing them that takes less than  variables, but that just corresponds to undoing the autoencoding and then performing some other compression.

Comment by Lucius Bushnaq (Lblack) on How useful is mechanistic interpretability? · 2023-12-02T08:40:43.438Z · LW · GW

SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger

Comment by Lucius Bushnaq (Lblack) on OpenAI: Facts from a Weekend · 2023-11-20T20:06:00.099Z · LW · GW

If actually enforcing the charter leads to them being immediately disempowered, it‘s not worth anything in the first place. We were already in the “worst case scenario”. Better to be honest about it. Then at least, the rest of the organisation doesn‘t get to keep pointing to the charter and the board as approving their actions when they don‘t.

The charter it is the board’s duty to enforce doesn‘t say anything about how the rest of the document doesn‘t count if investors and employees make dire enough threats, I‘m pretty sure.

Comment by Lucius Bushnaq (Lblack) on My Criticism of Singular Learning Theory · 2023-11-20T06:41:00.459Z · LW · GW

IIRC this is probably the case for a broad range of non-NN models. I think the original Double Descent paper showed it for random Fourier features.

My current guess is that NN architectures are just especially affected by this, due to having even more degenerate behavioral manifolds, ranging very widely from tiny to large RLCTs.

Comment by Lucius Bushnaq (Lblack) on The other side of the tidal wave · 2023-11-06T06:18:55.558Z · LW · GW

I am not a fan of the current state of the universe. Mostly the part where people keep dying and hurting all the time. Humans I know, humans I don't know, other animals that might or might not have qualia, possibly aliens in distant places and Everett branches. It's all quite the mood killer for me, to put it mildly. 

So if we pull off not dying and turning Earth into the nucleus of an expanding zero utility stable state, superhuman AI seems great to me.

Comment by Lucius Bushnaq (Lblack) on Charbel-Raphaël and Lucius discuss Interpretability · 2023-11-02T23:48:46.551Z · LW · GW

However, there are mostly no such constraints in ANN training (by default), so it doesn't seem destined to me that LLM behaviour should "compress" very much

The point of the Singular Learning Theory digression was to help make legible why I think this is importantly false. NN training has a strong simplicity bias, basically regardless of the optimizer used for training, and even in the absence of any explicit regularisation. This bias towards compression is a result of the particular degenerate structure of NN loss landscapes, which are in turn a result of the NN architectures themselves. Simpler solutions in these loss landscapes have a lower "learning coefficient," which you might conceptualize as an "effective" parameter count, meaning they occupy more (or higher dimensional, in the idealized case) volume in the loss landscape than more complicated solutions with higher learning coefficients.

This bias in the loss landscapes isn't quite about simplicity alone. It might perhaps be thought of as a particular mix of a simplicity prior, and a peculiar kind of speed prior. 

That is why Deep Learning works in the first place. That is why NN training can readily yield solutions that generalize far past the training data, even when you have substantially more parameters than data points to fit on. That is why, with a bit of fiddling around, training a transformer can get you a language model, whereas training a giant polynomial on predicting internet text will not get you a program that can talk.  SGD or no SGD, momentum or no momentum, weight regularisation or no weight regularisation. Because polynomial loss landscapes do not look like NN loss landscapes.

Comment by Lucius Bushnaq (Lblack) on There are no coherence theorems · 2023-08-31T22:33:10.581Z · LW · GW

I have now seen this post cited in other spaces, so I am taking the time to go back and write out why I do not think it holds water.

I do not find the argument against the applicability of the Complete Class theorem convincing.

See Charlie Steiner's comment: 

You just have to separate "how the agent internally represents its preferences" from "what it looks like the agent us doing." You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But externally it looks like this agent is acting like it assigns equal value to whatever indifferent things it thought of choosing between first.

Decision theory is concerned with external behaviour, not internal representations. All of these theorems are talking about whether the agent's actions can be consistently described as maximising a utility function. They are not concerned whatsoever with how the agent actually mechanically represents and thinks about its preferences and actions on the inside. To decision theory, agents are black boxes. Information goes in, decision comes out. Whatever processes may go on in between are beyond the scope of what the theorems are trying to talk about.

So

Money-pump arguments for Completeness (understood as the claim that sufficiently-advanced artificial agents will have complete preferences) assume that such agents will not act in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ But that assumption is doubtful. Agents with incomplete preferences have good reasons to act in accordance with this kind of policy: (1) it never requires them to change or act against their preferences, and (2) it makes them immune to all possible money-pumps for Completeness. 

As far as decision theory is concerned, this is a complete set of preferences. Whether the agent makes up its mind as it goes along or has everything it wants written up in a database ahead of time matters not a peep to decision theory. The only thing that matters is whether the agent's resulting behaviour can be coherently described as maximising a utility function. If it quacks like a duck, it's a duck.

Comment by Lucius Bushnaq (Lblack) on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-30T19:28:08.184Z · LW · GW

... No?

I don't see what part of the graphs would lead to that conclusion. As the paper says, there's a memorization, circuit formation and cleanup phase. Everywhere along these lines in the three phases, the network is building up or removing pieces of internal circuitry. Every time an elementary piece of circuitry is added or removed, that corresponds to moving into a different basin (convex subset?). 

Points in the same basin are related by internal symmetries. They correspond to the same algorithm, not just in the sense of having the same input-output behavior on the training data (all points on the zero loss manifold have that in common), but also in sharing common intermediate representations. If one solution has a piece of circuitry another doesn't, they can't be part of the same basin. Because you can't transform them into each other through internal symmetries.

So the network is moving through different basins all along those graphs.

Comment by Lucius Bushnaq (Lblack) on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-29T07:57:50.060Z · LW · GW

I don't see how the mechanistic interpretability of grokking analysis is evidence against this.

At the start of training, the modular addition network is quickly evolving to get increasingly better training loss by overfitting on the training data. Every time it gets an answer in the training set right that it didn't before, it has to have moved from one behavioural manifold in the loss landscape to another. It's evolved a new tiny piece of circuitry, making it no longer the same algorithm it was a couple of batch updates ago.

Eventually, it reaches the zero loss manifold. This is a mostly fully connected subset of parameter space. I currently like to visualise it like a canyon landscape, though in truth it is much more high dimensional. It is made of many basins, some broad (high dimensional), some narrow (low dimensional), connected by paths, some straight, some winding. 

A path through the loss landscape visible in 3D doesn't correspond to how and what the neural network is actually learning. Almost all of the changes to the loss are due to the increasingly good implementation of Algorithm 1; but apparently, the entire time, the gradient also points towards some faraway implementation of Algorithm 2.

In the broad basin picture, there aren't just two algorithms here, but many. Every time the neural network constructs a new internal elementary piece of circuitry, that corresponds to moving from one basin in this canyon landscape to another. Between the point where the loss flatlines and the point where grokking happens, the network is moving through dozens of different basins or more. Eventually, it arrives at the largest, most high dimensional basin in the landscape, and there it stays.

the entire time the neural network's parameters visibly move down the wider basin

I think this might be the source of confusion here. Until grokking finishes, the network isn't even in that basin yet. You can't be in multiple basins simultaneously.

At the time the network is learning the pieces of what you refer to as algorithm 2, it is not yet in the basin of algorithm 2. Likewise, if you went into the finished network sitting in the basin of algorithm 2 and added some additional internal piece of circuitry into it by changing the parameters, that would take it out of the basin of algorithm 2 and into a different, narrower one. Because it's not the same algorithm any more. It's got a higher effective parameter count now, a bigger Real Log Canonical Threshold.

Points in the same basin correspond to the same algorithm. But it really does have to be the same algorithm. The definition is quite strict here. What you refer to as superpositions of algorithm 1 and algorithm 2 are all various different basins in parameter space. Basins are regions where every point maps to the same algorithm, and all of those superpositions are different algorithms. 

Comment by Lucius Bushnaq (Lblack) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T18:32:29.087Z · LW · GW

I also have this impression, except it seems to me that it's been like this for several months at least. 

The Open Philanthropy people I asked at EAG said they think the bottleneck is that they currently don't have enough qualified AI Safety grantmakers to hand out money fast enough. And right now, the bulk of almost everyone's funding seems to ultimately come from Open Philanthropy, directly or indirectly.

Comment by Lucius Bushnaq (Lblack) on When do "brains beat brawn" in Chess? An experiment · 2023-06-29T14:23:09.639Z · LW · GW

You can easily get a draw against any AI in the world at Tic-Tac-Toe. In fact, provided the game actually stays confined to the actions on the board, you can draw AIXI at Tic-Tac-Toe. That's because Tic-Tac-Toe is a very small game with very few states and very few possible actions, and so intelligence, the ability to pick good actions, doesn't grant any further advantage in it past a certain pretty low threshold. 

Chess has more actions and more states, so intelligence matters more. But probably still not all that much compared to the vastness of the state and action space the physical universe has. If there's some intelligence threshold past which minds pretty much always draw against each other in chess even if there is a giant intelligence gap between them, I wouldn't be that surprised. Though I don't have much knowledge of the game.

In the game of Real Life, I very much expect that "human level" is more the equivalent of a four year old kid who is currently playing their third ever game of chess, and still keeps forgetting half the rules every minute. The state and action space is vast, and we get to observe humans navigating it poorly on a daily basis. Though usually only with the benefit of hindsight. In many domains, vast resource mismatches between humans do not outweigh skill gaps between humans. The Chinese government has far more money than OpenAI, but cannot currently beat OpenAI at making powerful language models. All the usual comparisons between humans and other animals also apply. This vast difference in achieved outcomes from small intelligence gaps even in the face of large resource gaps does not seem to me to be indicative of us being anywhere close to the intelligence saturation threshold of the Real Life game.

Comment by Lucius Bushnaq (Lblack) on What is the foundation of me experiencing the present moment being right now and not at some other point in time? · 2023-06-22T19:26:43.635Z · LW · GW

There is no one theory of time in physics.

There are many popular hypotheses with all kinds of different implications related to time in some way, but those aren't part of standard textbook physics. They're proposed extensions of our current models. I'm talking about plain old general relativity+Standard Model QFT here. Spacetime is a four-dimensional manifold, fields in the SM Lagrangian have support on that manifold, all of those field have CPT symmetry. Don't go asking for quantum gravity or other matters related to UV-completion.[1]

All that gives you is an asymmetry, a distinction between the past and future, within a static block universe. It doesn't get you away from stasis to give you a dynamic "moving cursor" kind of present moment.

Combined with locality, the rule that things in spacetime can only affect things immediately adjacent to them, yeah, it does. Computations can only act on bits that are next to them in spacetime. To act on bits that are not adjacent, "channels" in spacetime have to connect those bits to the computation, carrying the information. So processing bits far removed from  at  is usually hard, due to thermodynamics, and takes place by proxy, using inference on bits near  that have mutual information with the past or future bits of interest. Thus computations at  effectively operate primarily on information near , with everything else grasped from that local information. From the perspective of such a computation, that's a "moving cursor".

(I'd note though that asymmetry due to thermodynamics on its own could presumably already serve fine for distinguishing a "present", even if there was no locality. In that case, the "cursor" would be a boundary to one side of which the computation loses a lot of its ability to act on bits. From the inside perspective, computations at  would be distinguishable from computations at  and  in such a universe, by what algorithms are used to calculate on specific bits, with algorithms that act on bits "after"  being more expensive at . I don't think self-aware algorithms in that world would have quite the same experience of "present" we do, but I'd guess they would have some "cursor-y" concept/sensation.

I'm not sure how hard constructing a universe without even approximate locality,  but with thermodynamics-like behaviour and the possibility of Turing-complete computation would be though. Not sure if it is actually a coherent set-up. Maybe coupling to non-local points that hard just inevitably makes everything max-entropic everywhere and always.)

  1. ^

    I mean, do ask, by all means, but the answer probably won't be relevant for this discussion, because you can get planet earth and the human brains on it thinking and perceiving a present moment from a plain old SM lattice QFT simulation. Everyone in that simulation quickly dies because the planet has no gravity and spins itself apart, but they sure are experiencing a present until then.[2]

  2. ^

    Except there also might not be a Born rule in the simulation, but let's also ignore that, and just say we read off what's happening in the high amplitude parts of the simulated earth wave-function without caring that the amplitude is pretty much a superfluous pre-factor that doesn't do anything in the computation.

Comment by Lucius Bushnaq (Lblack) on What is the foundation of me experiencing the present moment being right now and not at some other point in time? · 2023-06-18T09:58:32.659Z · LW · GW

Under the premise of spacetime being a static and eternal thing, doesn‘t any line of thought trying to answer this question necessarily make any intuitive notions of identity and the passing of time illusionary?

Illusory in what sense? The underlying laws of our universe are time symmetric[1], but the second law of thermodynamics means that entropy increases as you move away in time from a set low-entropy point (the big bang). This means that predicting bits at  from bits at  tends to be a much more difficult exercise than predicting bits at  from bits at . Large amounts of detailed information (entropy) about  can, with some tricks, often be read off from  with little computational effort. "Memory" is one way to do this. 

There are fewer ways to read off lots of information about  cheaply from . It is only possible in some very specific situations. You could, for example, look at a couch in a windowless room at time , and commit to look at that exact couch from that exact position at that exact angle again one week later. This would let you pretty reliably infer a large batch of future visual bits. But such techniques do not tend to generalise well, you can't do this for arbitrary future visual information the way you can use "memory" to do so for a wide class of past visual information. Thermodynamics means the trick only works one way, for bits that are closer in time to the big bang. To do the same for future bits is theoretically possible, but it typically requires different techniques and a far, far larger compute investment.

For an observer operating under such physics, it is useful to conceptualise the world as consisting of a "past" that has "already happened", and is thus amendable to inference through techniques like memory, a "future" that has "yet to happen" and is more uncertain, and a kind of "present", where these regimes meet, close to which techniques for inference from both regimes tend to be most efficient, and where information can be processed directly, because physics is local in time as well as space. Thus memory from  with  tends to be easier to protect from degradation than memory at , computing bit predictions for  tends to cost way less than for , and so on.

If you look at an intelligence inside such local physics, you will see that its internals at any point in time  tend to be busy computing stuff about , the "present moment" which the computation can locally operate on and affect, and which can often take in information from the "past", especially the "recent past" fairly easily, but has a harder time taking in information from the future, to the point that doing so usually involves totally different algorithms which feel totally different. So it feels to the computation at  that "it exists", "now".

I wouldn't really call this an illusion, except in the sense that "trees" are an illusion. A tree is fundamentally just some quantum field[2] excitations from a particular class of excited states inside a finite 4D volume. But its medium scale, medium-range interactions with baryonic matter are often pretty well described by the human concept of "tree".

 Likewise, dividing time into a "future", which has "yet to happen", a "past", which "has happened", and a "present", which "is happening", is a leaky abstraction of the underlying laws about performing inference and decision computations in a physics with locality, the second law of thermodynamics, and a low-entropy state at some  (big bang). It's not precise, but a good approximation under many circumstances. 

Imagine someone in the desert thinks they see an oasis. If it is actually a mirage, I'd say it makes sense to call the oasis an illusion. If it is an actual oasis, I don't think the moniker illusion is apt just because oases are really an imperfect abstraction of particular quantum field configurations.

  1. ^

    Well, actually CPT symmetric, but the distinction doesn't matter for the intuition here.

  2. ^

    Or, if you don't believe in asymptotically safe quantum gravity, it might not really be quantum fields either. Substitute whatever your favoured guess for the true fundamental physical theory is.

Comment by Lucius Bushnaq (Lblack) on Solomonoff’s solipsism · 2023-05-08T08:47:13.086Z · LW · GW

The sequence a hypothesis predicts the inductor to receive is not the world model that hypothesis implies.

A hypothesis can consist of very simple laws of physics describing time evolution in an eternal universe, yet predict that the sequence will be cut off soon because the camera that is sending the pixel values that are the sequence the inductor is seeing is about to die. 

Comment by Lucius Bushnaq (Lblack) on Should we publish mechanistic interpretability research? · 2023-04-21T18:58:25.634Z · LW · GW

So you need a pretty strong argument that interp in particular is good for capabilities, which isn't borne out empirically and also doesn't seem that strong.

I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment. 

If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don't think they help much at all with alignment. There's a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I'm saying. I doubt there's many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you'd be able to figure out which ones those are in advance. Often, I expect you'd only know years after the insight has been published and the field has figured out all of what can be done with it.

I think it's all one tech tree, is what I'm saying. I don't think neural network theory neatly decomposes into a "make strong AGI architecture" branch and a "aim AGI optimisation at a specific target" branch. Just like quantum mechanics doesn't neatly decompose into a "make a nuclear bomb" branch and a "make a nuclear reactor" branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation. 

By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it's lower in the tech tree.

Comment by Lucius Bushnaq (Lblack) on The basic reasons I expect AGI ruin · 2023-04-19T17:17:46.626Z · LW · GW

I didn't think Rob was necessarily implying that. I just tried to give some context to Quintin's objection.

Comment by Lucius Bushnaq (Lblack) on The basic reasons I expect AGI ruin · 2023-04-18T22:22:18.838Z · LW · GW

There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].   

However, I think "randomly sample from the space of solutions with low combined complexity&calculation cost" doesn't actually help us that much over a pure "randomly sample" when it comes to alignment. 

It could mean that the relation between your network's learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don't consider getting excellent visual cortex prediction scores a central terminal goal of mine.

  1. ^

    Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.

Comment by Lucius Bushnaq (Lblack) on Quintin Pope's Shortform · 2023-04-18T22:05:02.442Z · LW · GW

Whether singular learning theory actually yields you anything useful when your optimiser converges to the largest singularity seems very much architecture dependent though? If you fit a 175 billion degree polynomial to the internet to do next token prediction, I think you'll get out nonsense, not GPT-3. Because a broad solution in the polynomial landscape does not seem to equal a Kolomogorov-simple solution to the same degree it does with MLPs or transformers.

Likewise, there doesn't seem to be anything saying you can't have an architecture with an even better simplicity and speed bias than the MLP family has.

Comment by Lucius Bushnaq (Lblack) on If interpretability research goes well, it may get dangerous · 2023-04-05T09:19:11.153Z · LW · GW

Best I've got is to go dark once it feels like you're really getting somewhere, and only work with people under NDAs (honour based or actually legally binding) from there on out. At least a facsimile of proper security, central white lists of orgs and people considered trustworthy, central standard communication protocols with security levels set up to facilitate communication between alignment researchers. Maybe a forum system that isn't on the public net. Live with the decrease in research efficiency this brings, and try to make it to the finish line in time anyway.

If some org or people would make it their job to start developing and trial running these measures right now, I think that'd be great. I think even today, some researchers might be enabled to collaborate more by this.

Very open to alternate solutions that don't cost so much efficiency if anyone can think of any, but I've got squat.

Comment by Lucius Bushnaq (Lblack) on Practical Pitfalls of Causal Scrubbing · 2023-03-28T04:40:25.953Z · LW · GW

Second paragraph is what I meant, thanks.

Comment by Lucius Bushnaq (Lblack) on CAIS-inspired approach towards safer and more interpretable AGIs · 2023-03-27T20:26:48.969Z · LW · GW

Seems like a slight variant on MIRI's visible thoughts project?

Comment by Lucius Bushnaq (Lblack) on Practical Pitfalls of Causal Scrubbing · 2023-03-27T16:16:36.982Z · LW · GW

The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:

I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes  at layer l of a NN implements the function , but in fact it implements , where g does some other thing, my hypothesis was incorrect, and I'd want the metric to show that. If I haven't even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don't think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as "small random fluctuations" in the ablated test loss.

The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I'll be able to predict whether it will generalize correctly onto a particular new distribution.

I don't see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism. 

I want to understand how the model computes its outputs to get loss  on distribution , so I can predict what loss it will get on another distribution . I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention  on the network, say shifting  to , should not change behaviour, because the NN only cares about , not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of  on net, my hypothesis was wrong. The magnitude of  does play a part in the network's computations on . It has an influence on the output.

But that interference was random and understanding it won't help you know if the mechanism that the model was using is going to generalize well to another distribution.

If it had no effect on how outputs for  are computed, then destroying it should not change behaviour on . So there should be no divergence between the original and ablated models' outputs. If it did affect behaviour on , but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model's ability to do well on 

A heuristic that fires on some of , but doesn't really help much, might turn out to be crucial for doing well on . A leftover memorised circuit that didn't get cleaned up might add harmless "noise" on net on , but ruin generalisation to .

I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren't really necessary for that particular class of inputs. If you throw out everything that doesn't do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.

Comment by Lucius Bushnaq (Lblack) on Practical Pitfalls of Causal Scrubbing · 2023-03-27T13:45:04.499Z · LW · GW

Your suggestion of using  seems a useful improvement compared to most metrics. It's, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like   could allow for less ambiguity).

It's not about  vs. some other loss function. It's about using a one dimensional summary of a high dimensional comparison, instead of a one dimensional comparison. There are many ways for two neural networks to both diverge from some training labels  by an average loss  while spitting out very different outputs. There are tautologically no ways for two neural networks to have different output behaviour without having non-zero divergence in label assignment for at least some data points. Thus, it seems that you would want a metric that aggregates the divergence of the two networks' outputs from each other, not a metric that compares their separate aggregated divergences from some unrelated data labels and so throws away most of the information.

A low dimensional summary of a high dimensional comparison between the networks seems fine(ish). A low dimensional comparison between the networks based on the summaries of their separate comparisons to a third distribution throws away a lot of the relevant information.

Comment by Lucius Bushnaq (Lblack) on Practical Pitfalls of Causal Scrubbing · 2023-03-27T09:27:27.996Z · LW · GW

CaSc can fail to reject a hypothesis if it is too unspecific and is extensionally equivalent to the true hypothesis.

Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?

On the input-output level, we found that CaSc can fail to reject false hypotheses due to cancellation, i.e. because the task has a certain structural distribution that does not allow resampling to differentiate between different hypotheses.

I don't know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

Wouldn't you want to directly compare the divergence on outputs between the original graph  and ablated graph  instead? The  divergence between their output distributions over the data is the first thing that'd come to my mind. Or keeping whatever the original loss function is, but with the outputs of  as the new ground truth labels.

That's still ad hocery of course, but it should at least take care of the failure mode you point out here. Is this really not part of current CaSc?

Comment by Lucius Bushnaq (Lblack) on Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research · 2023-03-24T12:47:40.730Z · LW · GW

Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there's nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.