[Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit 2023-11-30T14:01:32.255Z
Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm 2023-06-01T11:31:37.796Z
A Neural Network undergoing Gradient-based Training as a Complex System 2023-02-19T22:08:56.243Z
Notes on the Mathematics of LLM Architectures 2023-02-09T01:45:48.544Z
On Developing a Mathematical Theory of Interpretability 2023-02-09T01:45:01.521Z
Some Notes on the mathematics of Toy Autoencoding Problems 2022-12-22T17:21:25.371Z
Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism 2022-09-03T00:15:40.680Z
A brief note on Simplicity Bias 2022-08-14T02:05:01.000Z
Notes on Learning the Prior 2022-07-15T17:28:36.631Z
An observation about Hubinger et al.'s framework for learned optimization 2022-05-13T16:20:33.974Z


Comment by Spencer Becker-Kahn on Can we get an AI to do our alignment homework for us? · 2024-02-29T09:23:33.616Z · LW · GW

I for one would find it helpful if you included a link to at least one place that Eliezer had made this claim just so we can be sure we're on the same page. 

Roughly speaking, what I have in mind is that there are at least two possible claims. One is that 'we can't get AI to do our alignment homework' because by the time we have a very powerful AI that can solve alignment homework, it is already too dangerous to use the fact it can solve the homework as a safety plan. And the other is the claim that there's some sort of 'intrinsic' reason why an AI built by humans could never solve alignment homework.

Comment by Spencer Becker-Kahn on We need a Science of Evals · 2024-01-23T01:51:15.643Z · LW · GW

You refer a couple of times to the fact that evals are often used with the aim of upper bounding capabilities. To my mind this is an essential difficulty that acts as a point of disanalogy with things like aviation. I’m obviously no expert but in the case of aviation, I would have thought that you want to give positive answers to questions like “can this plane safely do X thousand miles?” - ie produce absolutely guaranteed lower bounds on ‘capabilities’. You don’t need to find something like the approximately smallest number Y such that it could never under any circumstances ever fly more than Y million miles.

Comment by Spencer Becker-Kahn on AlphaGeometry: An Olympiad-level AI system for geometry · 2024-01-18T12:36:59.220Z · LW · GW

Hmm it might be questionable to suggest that it is "non-AI" though? It's based on symbolic and algebraic deduction engines and afaict it sounds like it might be the sort of thing that used to be very much mainstream "AI" i.e. symbolic AI + some hard-coded human heuristics?

Comment by Spencer Becker-Kahn on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-01-18T11:26:23.251Z · LW · GW

FWIW I did not interpret Thane as necessarily having "high confidence" in "architecture / internal composition" of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to "model the world" counts as a statement about "internal composition" is sort of beside the point/beyond the scope of what's really being said)

It's fair enough if you would say things differently(!) but in some sense isn't it just pointing out: 'I would emphasize different aspects of the same underlying basic point'. And I'm not sure if that really progresses the discussion? I.e. it's not like Thane Ruthenis actually claims that "scarily powerful artificial agents" currently exist. It is indeed true that they don't exist and may not ever exist. But that's just not really the point they are making so it seems reasonable to me that they are not emphasizing it.


I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!". 

I think I would also like to see more thought about this. In some ways, after first getting into the general area of AI risk, I was disappointed that the alignment/safety community was not more focussed on questions like this. Like a lot of people, I'd been originally inspired by Superintelligence - significant parts of which relate to these questions imo - only to be told that the community had 'kinda moved away from that book now'. And so I sort of sympathize with the vibe of Thane's post (and worry that there has been a sort of mission creep)

Comment by Spencer Becker-Kahn on Value systematization: how values become coherent (and misaligned) · 2024-01-11T14:39:16.692Z · LW · GW

Newtonian mechanics was systematized as a special case of general relativity.

One of the things I found confusing early on in this post was that systemization is said to be about representing the previous thing as an example or special case of some other thing that is both simpler and more broadly-scoped. 

In my opinion, it's easy to give examples where the 'other thing' is more broadly-scoped and this is because 'increasing scope' corresponds to the usual way we think of generalisation, i.e. the latter thing applies to more setting or it is 'about a wider class of things' in some sense. But in many cases, the more general thing is not simultaneously 'simpler' or more economical. I don't think anyone would really say that general relativity were actually simpler. However,  to be clear, I do think that there probably are some good examples of this, particularly in mathematics, though I haven't got one to hand. 

Comment by Spencer Becker-Kahn on Understanding Subjective Probabilities · 2023-12-12T15:50:42.572Z · LW · GW

OK I think this will be my last message in this exchange but I'm still confused. I'll try one more time to explain what I'm getting at. 

I'm interested in what your precise definition of subjective probability is. 

One relevant thing I saw was the following sentence:

If I say that a coin is 50% likely to come up heads, that's me saying that I don't know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options.

It seems to give something like a definition of what it means to say something has a 50% chance. i.e. I interpret your sentence as claiming that a statement like 'The probability of A is 1/2' means or is somehow the same as a statement a bit like

[*]  'I don't know the exact conditions and don't have enough meaningful/relevant knowledge to distinguish between the possible occurrence of (A) and (not A)'

My reaction was: This can't possibly be a good definition. 

The airplane puzzle was supposed to be a situation where there is a clear 'difference' in the outcomes - either the last person is in the 1 seat that matches their ticket number or they're not. - they're in one of the other 99 seats. It's not as if it's a clearly symmetric situation from the point of view of the outcomes. So it was supposed to be an example where statement [*] does not hold, but where the probability is 1/2. It seems you don't accept that; it seems to me like you think that statement [*] does in fact hold in this case. 

But tbh it feels sorta like you're saying you can't distinguish between the outcomes because you already know the answer is 1/2! i.e. Even if I accept that the outcomes are somehow indistinguishable, the example is sufficiently complicated on a first reading that there's no way you'd just look at it and go "hmm I guess I can't distinguish so it's 1/2", i.e. if your definition were OK it could be used to justify the answer to the puzzle, but that doesn't seem right to me either.  

Comment by Spencer Becker-Kahn on Understanding Subjective Probabilities · 2023-12-12T10:05:04.580Z · LW · GW

So my point is still: What is that thing? I think yes I actually am trying to push proponents of this view down to the metaphysics - If they say "there's a 40% chance that it will rain tomorrow", I want to know things like what it is that they are attributing 40%-ness to.  And what it means to say that that thing "has probability 40%".  That's why I fixated on that sentence in particular because it's the closest thing I could find to an actual definition of subjective probability in this post.


Comment by Spencer Becker-Kahn on Understanding Subjective Probabilities · 2023-12-12T09:47:48.741Z · LW · GW

I have in mind very simple examples.  Suppose that first I roll a die. If it doesn't land on a 6, I then flip a biased coin that lands on heads 3/5 of the time.  If it does land on a 6 I just record the result as 'tails'. What is the probability that I get heads? 

This is contrived so that the probability of heads is 

5/6 x 3/5 = 1/2.

But do you think that that in saying this I mean something like "I don't know the exact initial conditions... well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options." ?

Another example: Have you heard of the puzzle about the people randomly taking seats on the airplane? It's a well-known probability brainteaser to which the answer is 1/2 but I don't think many people would agree that saying the answer is 1/2 actually means something like "I don't know the exact initial conditions... well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options." 

There needn't be any 'indistinguishability of outcomes' or 'lack of information' for something to have probability 0.5, it can just..well... be the actual result of calculating two distinguishable complementary outcomes.

Comment by Spencer Becker-Kahn on Understanding Subjective Probabilities · 2023-12-11T17:22:15.662Z · LW · GW

We might be using "meaning" differently then!

I'm fine with something being subjective, but what I'm getting at is more like: Is there something we can agree on about which we are expressing a subjective view? 

Comment by Spencer Becker-Kahn on Understanding Subjective Probabilities · 2023-12-11T17:19:12.039Z · LW · GW

I'm kind of confused what you're asking me - like which bit is "accurate" etc.. Sorry, I'll try to re-state my question again:

- Do you think that when someone says something has "a 50% probability" then they are saying that they do not have any meaningful knowledge that allows them to distinguish between two options?

I'm suggesting that you can't possibly think that, because there are obviously other ways things can end up 50/50. e.g. maybe it's just a very specific calculation, using lots of specific information, that ends up with the value 0.5 at the end. This is a different situation from having 'symmetry' and no distinguishing information.

Then I'm saying OK, assuming you indeed don't mean the above thing, then what exactly does one mean in general when saying something is 50% likely?


Comment by Spencer Becker-Kahn on Understanding Subjective Probabilities · 2023-12-11T12:10:44.995Z · LW · GW

Presumably you are not claiming that saying

...I don't know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options...

is actually necessarily what it means whenever someone says something has a 50% probability? Because there are obviously myriad ways something can have a 50% probability and this kind of 'exact symmetry between two outcomes' + no other information is only one very special way that it can happen. 

So what does it mean exactly when you say something is 50% likely?

Comment by Spencer Becker-Kahn on Understanding Subjective Probabilities · 2023-12-11T11:59:25.917Z · LW · GW

The traditional interpretation of probability is known as frequentist probability. Under this interpretation, items have some intrinsic "quality" of being some % likely to do one thing vs. another. For example, a coin has a fundamental probabilistic essence of being 50% likely to come up heads when flipped.

Is this right? I would have said that what you describe is a more like the classical, logical view of probability, which isn't the same as the frequentist view. Even the wiki page you've linked seems to disagree with what you've written, i.e. it describes the frequentist view in the standard way of being about relative frequencies in the long-run. So it isn't a coin having intrinsic "50%-ness"; you actually need the construction of the repeated experiment in order to define the probability.

Comment by Spencer Becker-Kahn on At 87, Pearl is still able to change his mind · 2023-10-19T13:41:40.991Z · LW · GW

My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality.

Short comment/feedback just to say: This sentence is making one of your main points but is very tricky! - perhaps too long/too many subclauses?

Comment by Spencer Becker-Kahn on A Defense of Work on Mathematical AI Safety · 2023-09-14T14:19:27.700Z · LW · GW

Ah OK, I think I've worked out where some of my confusion is coming from:  I don't really see any argument for why mathematical work may be useful, relative to other kinds of foundational conceptual work. e.g. you write (with my emphasis): "Current mathematical research could play a similar role in the coming years..." But why might it? Isn't that where you need to be arguing? 

The examples seem to be of cases where people have done some kind of conceptual foundational work which has later gone on to influence/inspire ML work. But early work on deception or goodhart was not mathematical work, that's why I don't understand how these are examples. 


Comment by Spencer Becker-Kahn on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-09-05T09:58:34.651Z · LW · GW

Thanks for the comment Rohin, that's interesting (though I haven't looked at the paper you linked).

I'll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers  and  and don't do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I'm doing 'addition modulo 113', but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.


Comment by Spencer Becker-Kahn on A Defense of Work on Mathematical AI Safety · 2023-09-05T09:49:02.812Z · LW · GW

I'm still not sure I buy the examples. In the early parts of the post you seem to contrast 'machine learning research agendas' with 'foundational and mathematical'/'agent foundations' type stuff. Mechanistic interpretability can be quite mathematical but surely it falls into the former category? i.e. it is essentially ML work as opposed to constituting an example of people doing "mathematical and foundational" work. 

I can't say much about the Goodhart's Law comment but it seems at best unclear that its link to goal misgeneralization is an example of the kind you are looking for (i.e. in the absence of much more concrete examples, I have approximately no reason to believe it has anything to do with what one would call mathematical work). 

Comment by Spencer Becker-Kahn on A Defense of Work on Mathematical AI Safety · 2023-07-07T12:30:47.242Z · LW · GW

Strongly upvoted.

I roughly think that a few examples showing that this statement is true will 100% make OP's case. And that without such examples, it's very easy to remain skeptical.

Comment by Spencer Becker-Kahn on Brief summary of · 2023-06-30T13:14:45.398Z · LW · GW

Currently, it takes a very long time to get an understanding of who is doing what in the field of AI Alignment and how good each plan is, what the problems are, etc.

Is this not ~normal for a field that it maturing? And by normal I also mean approximately unavoidable or 'essential'. Like I could say 'it sure takes a long time to get an understanding of who is doing what in the field of... computer science', but I have no reason to believe that I can substantially 'fix' this situation in the space of a few months. It just really is because there is lots of complicated research going on by lots of different people, right? And 'understanding' what another researcher is doing is sometimes a really, really hard thing to do.

Comment by Spencer Becker-Kahn on ARC is hiring theoretical researchers · 2023-06-14T10:42:04.780Z · LW · GW

I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work).  But I decided maybe it's best to comment in a way that gives a better signal than silence. 

I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that. 

Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.

I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.

I can try to explain a little more: It seemed odd that the "potential" applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered - before writing the paper - questions like "OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?" Some of what was said in the paper was fairly vague stuff like: 

If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible. 

In my opinion, it's also important to bear in mind that the criteria of a problem being 'open' is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like:  That the solution would seem to require new insights into X and therefore a proof would 'have to be' deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc.  Perhaps more of these things need to be made explicit in order to argue more effectively that ARC's stating of these open problems about heuristic estimators is an interesting contribution in itself?

To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I'm saying:

Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.

But practically it means that when I ask myself something like: 'Why would I drop whatever else I'm working on and work on this stuff?' I find it quite hard to answer in a way that's not basically just all deference to some 'vision' that is currently undeclared (or as the paper says "mostly defer[red]" to "future articles").

Having said all this I'll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.


Comment by Spencer Becker-Kahn on [deleted post] 2023-06-07T12:26:48.518Z

How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to 'mentor matching'?

(e.g. see Richard Ngo's first high-level point in this career advice post)

Comment by Spencer Becker-Kahn on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-06-02T10:42:50.782Z · LW · GW

Hi Garrett, 

OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:

  • I did not use any words such as "trivial", "obvious" or "simple". Stories like the one you recount are obviously making fun of mathematicians, some of whom do think its cool to say things are trivial/simple/obvious after they understand them. I often strongly disagree and generally dislike this behaviour and think there are many normal mathematicians who don't engage in this sort of thing.  In particular sometimes the most succinct insights are the hardest ones to come by (this isn't a reference to my post; just a general point).  And just because such insights are easily expressible once you have the right framing and the right abstractions, they should by no means be trivialized. 
  • I deliberately emphasized the subjectivity of making the sorts of judgements that I am making. Again this kinda forms part of the joke of the story.
  • I have indeed been aware of the work since when it was first posted 10 months ago or so and have given it some thought on and off for a while (in the first sentence of the post I was just saying that I didn't spend long writing the post, not that these thoughts were easily arrived-at).
  • I do not claim to have explained the entire algorithm, only to shed some light on why it might actually be a more natural thing to do than some people seem to have appreciated.
  • I think the original work is of a high quality and one might reasonably say 'groundbreaking'.

In another one of my posts I discuss at more length the kind of thing you bring up in the last sentence of your comment, e.g.

it can feel like the role that serious mathematics has to play in interpretability is primarily reactive, i.e. consists mostly of activities like 'adding' rigour after the fact or building narrow models to explain specific already-observed phenomena. 

....[but]... one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover - remarkably - often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking "What kind of thing seems most useful for understanding this object and making predictions?" but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste.

And e.g. I talk about how this sort of thing has been the case in areas like mathematical physics for a long time. Part of the point is that (in my opinion, at least) there isn't any neat shortcut to the kind of abstract thinking that lets you make the sort of predictions you are making reference to. It is very typical that you have to begin by reacting to existing empirical phenomena and using it as scaffolding. But I think, to me, it has come across as that you are being somewhat dismissive of this fact? As if, when B might well follow from A and someone actually starts to do A, you say "I would be far more impressed if B" instead of "maybe that's progress towards B"?

(Also FWIW,  Neel claims here that regarding the algorithm itself, another researcher he knows "roughly predicted this".)

Comment by Spencer Becker-Kahn on 'Fundamental' vs 'applied' mechanistic interpretability research · 2023-05-26T08:40:28.664Z · LW · GW

Interesting thoughts!

It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:

...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then what "program" a particular training regimen will lead to in that language  [7]. It seems to me that by focusing on specific trained models, most interpretability research discussed here is of the second type. But by constructing an effective theory for an entire class of architecture that's agnostic to the choice of dataset, PDLT is a rare example of the first type.

I don't necessarily totally agree with her phrasing but it does feel a bit like we are all gesturing at something vaguely similar (and I do agree with her that PDLT-esque work may have more insights in this direction than some people on our side of the community have appreciated).

FWIW, in a recent comment reply to Joseph Bloom, I also ended up saying a bit more about why I don't actually see myself working much more in this direction, despite it seeming very interesting, but I'm still on the fence about that.  (And one last point that didn't make it into that comment is the difficulties posed by a world in which increasingly the plucky bands of interpretability researchers on the fringes literally don't even know what the cutting edge architectures and training processes in the biggest labs even are.


Comment by Spencer Becker-Kahn on How MATS addresses “mass movement building” concerns · 2023-05-04T09:55:26.870Z · LW · GW

At the start you write

3. Unnecessarily diluting the field’s epistemics by introducing too many naive or overly deferent viewpoints.

And later Claim 3 is:

Scholars might defer to their mentors and fail to critically analyze important assumptions, decreasing the average epistemic integrity of the field

It seems to me there might be two things being pointed to?

A) Unnecessary dilution: Via too many naive viewpoints;
B) Excessive deference: Perhaps resulting in too few viewpoints or at least no new ones;

And arguably these two things are in tension, in the following sense: I think that to a significant extent, one of the sources of unnecessary dilution is the issue of less experienced people not learning directly from more experienced people and instead relying too heavily on other inexperienced peers to develop their research skills and tastes. i.e. you might say that A) is partly caused by insufficient deference.

I roughly think that that the downsides of de-emphasizing deference and the accumulation of factual knowledge from more experienced people are worse than keeping it as sort of the zeroth order/default thing to aim for. It seems to me that to the extent that one believes that the field is making any progress at all, one should think that increasingly there will be experienced people from whom less experienced people should expect - at least initially - to learn from/defer to. 

Looking at it from the flipside, one of my feelings right now is that we need mentors who don't buy too heavily into this idea that deference is somehow bad;  I would love to see more mentors who can and want to actually teach people. (cf. The first main point - one that I agree with - that Richard Ngo made in his recent piece on advice: The area is mentorship constrained. )


Comment by Spencer Becker-Kahn on On Developing a Mathematical Theory of Interpretability · 2023-04-27T13:56:19.552Z · LW · GW

Hey Joseph, thanks for the substantial reply and the questions!



Why call this a theory of interpretability as opposed to a theory of neural networks? 

Yeah this is something I am unsure about myself (I wrote: "something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI'"). But I think I was imagining that a 'theory of neural networks' would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are interesting about NNs mathematically or scientifically but which aren't really contributing to our ability to understand and manage the intelligences that NNs give rise to. So I wanted to try to shift the emphasis away from 'understanding NNs' and towards 'interpreting AI'. 

But maybe the distinction is more minor than I was originally worried about; I'm not sure. 


have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.

No, I haven't really. It was - and maybe still is - a sort of plan B of mine. I don't know anyone who I would say has this as their research agenda. I think the closest/next best thing people are well known, e.g. the more theoretical parts of Anthropic/Neel's work and more recently the interest in singular learning theory from Jesse Hoogland, Alexander GO, Lucius Bushnaq and maybe others. (afaict there is a belief that it's more than just 'theory of NNs' but can actually tell us something about safety of the AIs)


One thing I struggle to understand, and might bet against is that this won't involve studying toy models. To my mind, Neel's grokking work, Toy Models of Superposition, Bhillal's "A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations" all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore.... 

I think maybe I didn't express myself clearly or the analogy I tried to make didn't work as intended, because I think maybe we actually agree here(!). I think one reason I made it confusing is because my default position is more skeptical about MI than a lot of readers, with regards to the part where I said: "it is reasonable that the early stages of rigorous development don't naively 'look like' the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models." What I was trying to get at is that to me proving e.g. some mathematical fact about superposition in a toy model doesn't look like the kind of 'intepretability of AI' that you really ultimately want, it looks too low-level. It's a 'toy model' in the NN sense, but its not a toy version of the hard part of the problem. But I was trying to say that you would indeed have to let people like mathematicians actually ask these questions - i.e ask the questions about e.g. superposition that they would most want to know the answers to, rather than forcing them to only do work that obviously showed some connection to the bigger theme of the actual cognition of intelligent agents or whatever.

Thanks for the suggestions about next steps and for writing about what you're most interested in seeing. I think your second suggestion in particular is close to the sort of thing I'd be most interested in doing. But I think in practice, a number of factors have held me back from going down this route myself:

  • Main thing holding me back is probably something like: There just currently aren't enough people doing it - no critical mass. Obviously there's that classic game theoretic element here in that plausibly lots of people's minds would be simultaneously changed by there being a critical mass and so if we all dived in at once, it just works out. But it doesn't mean I can solve the issue myself. I would want way more people seriously signed up to doing this stuff including people with more experience than myself (and hopefully the possibility that I would have at least some 'access' to those people/opportunity to learn from them etc.) which seems quite unlikely.
  • It's really slow and difficult. I have had the impression talking to some people in the field that they like the sound of this sort of thing but I often feel that they are probably underestimating how slow and incremental it is. 
  • And a related issue is sort of the existence of jobs/job security/funding to seriously pursue it for a while without worrying too much in the short term about getting concrete results out.
Comment by Spencer Becker-Kahn on Misgeneralization as a misnomer · 2023-04-07T15:01:53.222Z · LW · GW

I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn't sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be 'happy' or not?

I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it was actually trying to do something that caused happiness). If not, then the pseudo-alignment might be more like (what RFLO calls) suboptimality in some sense i.e. it just looks aligned because it's not capable enough to imprison the humans in cells etc.

The type of pseudo-alignment in (2) otoh seems more clearly like "side-effect" alignment since you've been explicit that secretly it was pursuing other things that just happened to cash out into happiness in training.

AFAIK Richard Ngo was  explicit about wanting to clean up the zoo of inner alignment failure types in RFLO and maybe there just was some cost in doing this - some distinctions had to be lost perhaps?

Comment by Spencer Becker-Kahn on Beren's "Deconfusing Direct vs Amortised Optimisation" · 2023-04-07T09:46:37.910Z · LW · GW

This is a very strong endorsement but I'm finding it hard to separate the general picture from RFLO:

mesa-optimization occurs when a base optimizer...finds a model that is itself an optimizer,


a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

i.e. a mesa-optimizer is a learned model that 'performs inference' (i.e. evaluates inputs) by internally searching and choosing an output based on some objective function.

Apparently a "direct optimizer" is something that "perform[s] inference by directly choosing actions[1] to optimise some objective function". This sounds almost exactly like a mesa-optimizer?

Comment by Spencer Becker-Kahn on LW Team is adjusting moderation policy · 2023-04-05T14:21:30.759Z · LW · GW

I've always found it a bit odd that Alignment Forum submissions are automatically posted to LW. 

If you apply some of these norms, then imo there are questionable implications, i.e. it seems weird to say that one should have read the sequences in order to post about mechanistic interpretability on the Alignment Forum.

Comment by Spencer Becker-Kahn on Looking back on my alignment PhD · 2023-03-22T16:11:43.944Z · LW · GW

I really like this post and found it very interesting, particularly because I'm generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn't come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself. 

I think the tl;dr for my comment is sort of that to me the social dynamics "mistakes" don't really seem like mistakes - or at least not ones that were actually made by the author. 

Broadly speaking, these "mistakes" seem to me like mostly normal ways of learning and doing a PhD that happen for mostly good reasons and my reaction to the fact that these "mistakes" were "figured out" towards the end of the PhD is that this is a predictable part of the transition from being primarily a student to primarily an independent researcher (the fast-tracking of which would be more difficult than a lot of rationalists would like to believe). 

I also worry that emphasizing these things as "mistakes" might actually lead people to infer that they should 'do the opposite' from the start, which to me would sound like weird/bad advice: e.g Don't try to catch up with people who are more knowledgeable than you; don't try to seem smart and defensible; don't defer, you can do just as good by thinking everything through for yourself. 

I broadly agree that

rationality is not about the bag of facts you know.

but AI alignment/safety/x-risk isn't synonymous with rationality (Or is it? I realise TurnTrout does not directly claim that it is, which is why I'm maybe more cautioning against a misreading than disagreeing with him head on, but maybe he or others think there is a much closer relationship between rationality and alignment work than I do?). 

Is there not, by this point, something at least a little bit like "a bag of facts" that one should know in AI Alignment? People have been thinking about AI alignment for at least a little while now.  And so like, what have they achieved? Do we or do we not actually have some knowledge about the alignment problem? It seems to me that it would be weird if we didn't have any knowledge - like if there was basically nothing that we should count as established and useful enough to be codified and recorded as part of the foundations of the subject. It's worth wondering whether this has perhaps changed significantly in the last 5-10 years though, i.e. during TurnTrout's PhD. That is, perhaps - during that time - the subject has grown a lot and at least some things have been sufficiently 'deconfused' to have become more established concepts etc.  But generally, if there are now indeed such things, then these are probably things that people entering the field should learn about.  And it would seem likely that a lot of the more established 'big names'/productive people actually know a lot of these things and that "catching up with them" is a pretty good instrumental/proxy way to get relevant knowledge that will help you do alignment work. (I almost want to say: I know it's not fashionable in rationality to think this, but wanting to impress the teacher really does work pretty well in practice when starting out!)

Focussing on seeming smart and defensible probably can ultimately lead to a bad mistake. But when framed more as "It's important to come across as credible" or "It's not enough to be smart or even right; you actually do need to think about how others view you and interact with you", it's not at all clear that it's a bad thing; and certainly it more clearly touches on a regular topic of discussion in EA/rationality about how much to focus on how one is seen or how 'we' are viewed by outsiders. Fwiw I don't see any real "mistake" being actually described in this part of the post. In my opinion, when starting out, probably it is kinda important to build up your credibility more carefully. Then when Quintin came to TurnTrout, he writes that it took "a few days" to realize that Quintin's ideas could be important and worth pursuing.  Maybe the expectation in hindsight would be that he should have had the 'few days' old reaction immediately?? But my gut reaction is that that would be way too critical of oneself and actually my thought is more like 'woah he realised that after thinking about it for only a few days; that's great'. Can the whole episode not be read as a straightforward win: "Early on, it is important to build your own credibility by being careful about your arguments and being able to back up claims that you make in formal, public ways. Then as you gain respect for the right reasons, you can choose when and where to 'spend' your credibility... here's a great example of that..."

And then re: deference, certainly it was true for me that when I was starting out in my PhD, if I got confused reading a paper or listening to talk, I was likely to be the one who was wrong.  Later on or after my PhD, then, yeah, when I got confused by someone else's presentation, I was less likely to be wrong and it was more likely I was spotting an error in someone else's thinking. To me this seems like a completely normal product of the education and sort of the correct thing to be happening. i.e. Maybe the correct thing to do is to defer more when you have less experience and to gradually defer less as you gain knowledge and experience? I'm thinking that under the simple model that when one is confused about something, either you're misunderstanding or the other person is wrong, one starts out in the regime where your confusion is much more often better explained by the fact you have misunderstood and you end up in the regime where you actually just have way more experience thinking about these things and so are now more reliably spotting other people's errors. The rational response to the feeling of confusion changes because once fully accounted for the fact you just know way more stuff and are a way more experienced thinker about alignment. (One also naturally gains a huge boost to confidence as it becomes clear you will get your PhD and have good postdoc prospects etc... so it becomes easier to question 'authority' for that reason too, but it's not a fake confidence boost; this is mostly a good/useful effect because you really do now have experience of doing research yourself, so you actually are more likely to be better at spotting these things).


Comment by Spencer Becker-Kahn on Natural Abstractions: Key claims, Theorems, and Critiques · 2023-03-17T11:54:22.481Z · LW · GW

I've only skimmed this, but my main confusions with the whole thing are still on a fairly fundamental level. 

You spend some time saying what abstractions are, but when I see the hypothesis written down, most of my confusion is on what "cognitive systems" are and what one means by "most". Afaict it really is a kind of empirical question to do with "most cognitive systems". Do we have in mind something like 'animal brains and artificial neural networks'? If so then surely let's just say that and make the whole thing more concrete; so I suspect not....but in that case....what does it include? And how we will know if 'most' of them have some property? (At the moment, whenever I find evidence that two systems don't share an abstraction that they 'ought to' I can go "well the hypothesis is only most"...)

Comment by Spencer Becker-Kahn on Shutting Down the Lightcone Offices · 2023-03-15T10:33:07.851Z · LW · GW

Something ~ like 'make it legit' has been and possibly will continue to be a personal interest of mine.

I'm posting this after Rohin entered this discussion - so Rohin, I hope you don't mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck's old talk transcript 'My personal cruxes for working on AI safety'. (Rohin's comment repeated here in full and please bear in mind this is 3 years old; his views I'm sure have developed and potentially moved a lot since then:)

I enjoyed this post, it was good to see this all laid out in a single essay, rather than floating around as a bunch of separate ideas.

That said, my personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well, including:

1. Field building: Research done now can help train people who will be able to analyze problems and find solutions in the future, when we have more evidence about what powerful AI systems will look like.

2. Credibility building: It does you no good to know how to align AI systems if the people who build AI systems don't use your solutions. Research done now helps establish the AI safety field as the people to talk to in order to keep advanced AI systems safe.

3. Influencing AI strategy: This is a catch all category meant to include the ways that technical research influences the probability that we deploy unsafe AI systems in the future. For example, if technical research provides more clarity on exactly which systems are risky and which ones are fine, it becomes less likely that people build the risky systems (nobody _wants_ an unsafe AI system), even though this research doesn't solve the alignment problem.

As a result, cruxes 3-5 in this post would not actually be cruxes for me (though 1 and 2 would be).

Comment by Spencer Becker-Kahn on EIS V: Blind Spots In AI Safety Interpretability Research · 2023-02-18T01:57:54.089Z · LW · GW

Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here.  And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.

Ok it sounds to me like maybe there's at least two things being talked about here. One situation is

 A) Where a community includes different groups working on the same topic, and where those groups might use different terminology and have different ways of thinking about the same phenomena etc. This seems completely normal to me. The other situation is 

B) Where a group is isolated from the community at large and is using different terminology/thinking about things differently just as a result of their isolation and lack of communication. And where that behaviour then causes confusion and/or wasting of resources.

The latter doesn't sound good, but I guess it looks like to me that some or many of your points are consistent with the former being the case. So when you write e.g. it's not "necessarily a good thing either" or asking for my steelmanned case, this doesn't seem to quite make sense to me. I feel like if something is not necessarily good or bad, and you want to raise it as a criticism, then the onus would be on you to bring the case against TASIC with arguments that are not general ones that could easily apply to both A) and B) above. e.g.  It'd be more of an emphatic case if you were able to go into the details and be like "X did this work here and claimed it was new but actually it exists in Y's paper here" or give a real example of needless confusion that was created and could have been avoided. Focussing just on what they did or didn't 'engage with' on the level of general concepts and citations/acknowledgements doesn't bring this case convincingly, in my opinion. Some more vague thoughts on why that is:

  • Bodies of literature like this are usually very complicated and messy and people genuinely can't be expected to engage with everything. 
  • It's often hard or impossible to tack dependencies of ideas because of all the communication you cannot see and not being able to see 'how' people are thinking of things, only what they wrote.
  • Someone publishing on the same idea or concept or topic as you is nowhere near the same as someone actually doing the exact same technical thing that you are doing.  ime the former is happening all the time; and the latter is much rarer than people often think. 
  • Reinvention, re-presentation and even outright renaming or 'starting from scratch' are all valuable elements of scholarship that help a field move along.

Idk maybe I'm just repeating myself at this point.

On the other point: It may turn out the MI's analogy with reverse software engineering does not produce methods and is just used a high-level analogy,, but it seems too early to say from my perspective - the two posts I linked are from last year. TASIC is still pretty small and experienced researchers in TASIC are fewer and this is potentially a large and difficult research agenda.

Comment by Spencer Becker-Kahn on EIS V: Blind Spots In AI Safety Interpretability Research · 2023-02-17T19:20:47.644Z · LW · GW

Re: e.g. superposition/entanglement: 

I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it.  Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in favour of different terminology, i.e. even if something is "the same" when boiled down to a formal level, the different names can actually help delineate different interpretations.

In fact it's almost like a running joke in academia that there's always someone grumbling that you didn't cite the right things (their favourite work on this topic, their fellow countryman, them etc.) and because of the way academic literature works, some of the things that you are doing here can be done with almost any piece of work in the literature, i.e. you can comb over it with the benefit of hindsight and say 'hang on this isn't as original as it looked; basically the same idea that was written about here X years before' etc.  Honestly, I don't usually think of this as a valuable exercise, but I may be missing something about your wider point or be more convinced once I've looked at more of your series.

Another point when it comes to 'originality' and 'progress' is that it's often unimportant if some idea was generally discussed, labelled, named, or thought about before if what matters is actual results and the lower-level content of these works. i.e. I may be wrong, but looking at what you are saying, I don't think you are literally pulling up an older paper on 'entanglement' that made the exact same points that the Anthropic papers were making and did very similar experiments (Or are you?) And even having said that, reproducing experiments exactly is of course very valuable.

Re: MI and program synthesis:

I understand that your take is that it is closer to program synthesis or program induction and that these aren't all the same thing but in the first subsection of the "TASIC has reinvented..." section, I'm a little confused why there's no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert). 

Comment by Spencer Becker-Kahn on On Developing a Mathematical Theory of Interpretability · 2023-02-11T23:45:13.950Z · LW · GW

Thanks very much for the comments I think you've asked a bunch of very good questions. I'll try to give some thoughts:

Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's because our current mathematical understanding is lacking or something else (DL not being as 'reductionist' as other fields?). How do you lean in this regard? You mentioned that you're not sure when it comes to how amenable interpretability itself is, but would you guess that it's more or less amenable than deep learning as a whole?

I think I kind of share your general concern here and I’m uncertain about it. I kind of agree that it seems like people had been trying for a while to figure out the right way to think about deep learning mathematically and that for a while it seemed like there wasn’t much progress. But I mean it when I say these things can be slow. And I think that the situation is developing and has changed - perhaps significantly - in the last ~5 years or so, with things like the neural tangent kernel, the Principles of Deep Learning Theory results and increasingly high-quality work on toy models. (And even when work looks promising, it may still take a while longer for the cycle to complete and for us to get ‘real world’ results back out of these mathematical points of view, but I have more hope than I did a few years ago). My current opinion is that certain aspects of interpretability will be more amenable to mathematics than understanding DNN-based AI as a whole .



How would success of this relate to capabilities research? It's a general criticism of interpretability research that it also leads to heightened capabilities, would this fare better/worse in that regard? I would have assumed that a developed rigorous theory of interpretability would probably also entail significant development of a rigorous theory of deep learning.

I think basically your worries are sound. If what one is doing is something like ‘technical work aimed at understanding how NNs work’ then I don’t see there as being much distinction between capabilities and alignment ; you are really generating insights that can be applied in many ways, some good some bad (and part of my point is you have to be allowed to follow your instincts as a scientist/mathematician in order to find the right questions). But I do think that given how slow and academic the kind of work I’m talking about is, it’s neglected by both a) short timelines-focussed alignment people and b) capabilities people.



How likely is it that the direction one may proceed in would be correct? You mention an example in mathematical physics, but note that it's perhaps relatively unimportant that this work was done for 'pure' reasons. This is surprising to me, as I thought that a major motivation for pure math research, like other blue sky research, is that it's often not apparent whether something will be useful until it's well developed. I think this is the similar to you mentioning that the small scale problems will not like the larger problem. You mention that this involves following one's nose mathematically, do you think this is possible in general or only for this case? If it's the latter, why do you think interpretability is specifically amenable to it?

Hmm, that's interesting. I'm not sure I can say how likely it is one would go in the correct direction. But in my experience the idea that 'possible future applications' is one of the motivations for mathematicians to do 'blue sky' research is basically not quite right. I think the key point is that the things mathematicians end up chasing for 'pure' math/aesthetic reasons seem to be oddly and remarkably relevant when we try to describe natural phenomena (iirc this is basically a key point in Wigner's famous 'Unreasonable Effectiveness' essay.)  So I think my answer to your question is that this seems to be something that happens "in general" or at least does happen in various different places across science/applied math

Comment by Spencer Becker-Kahn on Notes on the Mathematics of LLM Architectures · 2023-02-09T19:50:22.128Z · LW · GW

Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.

Comment by Spencer Becker-Kahn on On Developing a Mathematical Theory of Interpretability · 2023-02-09T19:47:51.605Z · LW · GW

>There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural

I mean "natural" as opposed to "man made". i.e. something like "occurs in nature without being built by something or someone else". So in that sense, DNNs are obviously not natural in the way that the laws of physics are.

I don't see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about.  And this applies to the computations in NNs too. 

I don't want to study AI as mathematics or believe that AI is mathematics. I write that the practice of doing mathematics will only seek out the parts of the problem that are actually amenable to it; and my focus is on interpretability and not other places in AI that one might use mathematics (like, say, decision theory). 

You write "As an example, take "A mathematical framework for transformer circuits": it doesn't develop new mathematics. It just uses existing mathematics: tensor algebra.:" I don't think we are using 'new mathematics' in the same way and I don't think the way you are using it commonplace. Yes I am discussing the prospect of developing new mathematics, but this doesn't only mean something like 'making new definitions' or 'coming up with new objects that haven't been studied before'.  If I write a proof of a theorem that "just" uses "existing" mathematical objects, say like...matrices, or finite sets, then that seems to have little bearing on how 'new' the mathematics is. It may well be a new proof, of a new theorem, containing new ideas etc. etc. And it may well need to have been developed carefully over a long period of time.

Comment by Spencer Becker-Kahn on "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability · 2022-11-01T10:29:43.503Z · LW · GW

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

Comment by Spencer Becker-Kahn on Prize idea: Transmit MIRI and Eliezer's worldviews · 2022-09-21T21:20:26.716Z · LW · GW

Interesting idea. I think it’s possible that a prize is the wrong thing for getting the best final result (but also possible that getting a half decent result is more important than a high variance attempt at optimising for the best result). My thinking is: To do what you’re suggesting to a high standard could take months of serious effort. The idea of someone really competent doing so just for the chance at some prize money doesn’t quite seem right to me… I think there could be people out there who in principle could do it excellently but who would want to know that they’d ‘got the job’ as it were before spending serious effort on it.

Comment by Spencer Becker-Kahn on An Update on Academia vs. Industry (one year into my faculty job) · 2022-09-08T20:57:02.554Z · LW · GW

I think I would support Joe's view here that clarity and rigour are significantly different... but maybe - David - your comments are supposed to be specific to alignment work? e.g. I can think of plenty of times I have read books or articles in other areas and fields that contain zero formal definitions, proofs, or experiments but are obviously "clear", well-explained, well-argued etc. So by your definitions is that not a useful and widespread form of rigour-less clarity? (One that we would want to 'allow' in alignment work?) Or would you instead maintain that such writing can't ever really be clear without proofs or experiments?

I tend to think one issue is more that it's really hard to do well (clear, useful, conceptual writing that is) and that many of the people trying to do it in alignment come from are inexperienced in doing it (and often have backgrounds in fields where things like proofs or experiments are the norm).

Comment by Spencer Becker-Kahn on Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism · 2022-09-04T17:02:29.482Z · LW · GW

I agree that the space  may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space  may well be a more natural one. (It's of course the space of functions , and so a space in which 'model space' naturally sits in some sense. )

Comment by Spencer Becker-Kahn on Basin broadness depends on the size and number of orthogonal features · 2022-09-02T00:20:58.652Z · LW · GW

You're right about the loss thing; it isn't as important as I first thought it might be. 

Comment by Spencer Becker-Kahn on Basin broadness depends on the size and number of orthogonal features · 2022-08-31T16:35:13.093Z · LW · GW

It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.

I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say "The Hessian matrix for this network would be...", you don't get a factorization like that; you can't pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written - like a bilinear form for the multiplication between the rows and columns of .

A feature to me is the same kind of thing it is to e.g. Chris Olah. It's the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.

I'm not assuming that the function is linear in \Theta. If it was, this whole thing wouldn't just be an approximation within second order Taylor expansion distance, it'd hold everywhere. 

OK maybe I'll try to avoid a debate about exactly what 'feature' means or means to different people, but in the example, you are clearly using . This is a linear function of the  variables. (I said "Is the example not too particular it is linear in " - I'm not sure how we have misunderstood each other, perhaps you didn't realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to , it is an unusually clean expression because it now doesn't depend on  So again, in the crucial equation right after you say "The Hessian matrix for this network would be...", you in general get  variables appearing in the matrix. It is just not as clean as this expression suggests in general.

Comment by Spencer Becker-Kahn on Taking the parameters which seem to matter and rotating them until they don't · 2022-08-30T20:33:25.965Z · LW · GW

I'm not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.

Comment by Spencer Becker-Kahn on Basin broadness depends on the size and number of orthogonal features · 2022-08-29T12:36:34.321Z · LW · GW

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because  for me the sum over  was included in the function I called "", but it doesn't affect my main point).

I think the most concrete thing is that the function  - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like 

without any further explanation or indices. In your main computation it seems like it's being treated as a scalar. 

Since we never change the labels or the dataset, on can drop the explicit dependence on  from our notation for . Then if the network has  neurons in the final layer, the codomain of the function  is  (unless I've misunderstood what you are doing?). So to my mind we have:

Going through the computation in full using the chain rule (and a local minimum of the loss function ) one would get something like: 

Vivek wanted to suppose that  were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss. But without such an assumption, I don't think that the term 

appears (this is the matrix you describe as containing "the  inner products of the features over the training data set.")

Another (probably more important but higher-level) issue is basically: What is your definition of 'feature'? I could say: Have you not essentially just defined `feature' to be something like `an entry of '? Is the example not too contrived in that sense it clearly supposes that  has a very special form (in particular it is linear in the  variables so that the derivatives are not functions of .)

Comment by Spencer Becker-Kahn on Notes on Learning the Prior · 2022-07-26T16:13:12.859Z · LW · GW

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.

Comment by Spencer Becker-Kahn on The inordinately slow spread of good AGI conversations in ML · 2022-06-22T11:29:13.437Z · LW · GW

Thanks for the nice reply. 

I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbria), and I think they're sufficient to ~fully make sense of what's going on. So I don't feel confused about the situation anymore. By "shocking" I meant something more like "calls for an explanation", not "calls for an explanation, and I don't have an explanation that feels adequate". (With added overtones of "horrifying".)

Yeah, OK, I think that helps clarify things for me.

As someone who was working at MIRI in 2014 and watched events unfolding, I think the Hawking article had a negligible impact and the Musk stuff had a huge impact. Eliezer might be wrong about why Hawking had so little impact, but I do think it didn't do much.

Maybe we're misunderstanding each other here. I don't really doubt what you're saying there^ i.e. I am fully willing to believe that the Hawking thing had negligible impact and the Musk tweet had a lot. I'm more pointing to why Musk had a lot rather than why Hawking had little:  Trying to point out that since Musk was reacting to Superintelligence, one might ask whether he could have had a similar impact without Superintelligence. And so maybe the anecdote could be used as evidence that Superintelligence was really the thing that helped 'break the silence'. However,  Superintelligence feels way less like "being blunt" and "throwing a brick" and - at least from the outside - looks way more like the "scripts, customs, and established protocols" of "normal science" (i.e. Oxford philosophy professor writes book with somewhat tricky ideas in it, published by OUP, reviewed by the NYT etc. etc.) and clearly is an attempt to make unusual ideas sound "sober and serious". So I'm kind of saying that maybe the story doesn't necessarily argue against the possibility of doing further work like that that - i.e. writing books that manage to stay respectable and manage to "speak accurately and concretely about the future of AI without sounding like a sci-fi weirdo"(?) 

Comment by Spencer Becker-Kahn on The inordinately slow spread of good AGI conversations in ML · 2022-06-22T09:53:42.053Z · LW · GW

I'm a little sheepish about trying to make a useful contribution to this discussion without spending a lot of time thinking things through but I'll give it a go anyway. There's a fair amount that I agree with here, including that there is by now a lot of introductory resources. But regarding the following:

(I do think it's possible to create a much better intro resource than any that exist today, but 'we can do much better' is compatible with 'it's shocking that the existing material hasn't already finished the job'.)

I feel like I want to ask: Do you really find it "shocking"? My experience with explaining things to more general audiences leaves me very much of the opinion that it is by default an incredibly slow and difficult process to get unusual, philosophical, mathematical, or especially technical ideas to permeate. I include 'average ML engineer' as something like a "more general audience" member relative to MIRI style AGI Alignment theory.  I guess I haven't thought it about it much but presumably there exist ideas/arguments that are way more mainstream, also very important, and with way more written about them that people still somehow, broadly speaking, don't engage with or understand? 

I also don't really understand how the point that is being made in the quote from Inadequate Equilibria is supposed to work. Perhaps in the book more evidence is provided for when "the silence broke", but the Hawking article was before the release of Superintelligencea and then the Musk tweet was after it and was reacting to it(!) .. So I guess I'm sticking up for AGI x-risk respectability politics a bit here because surely I might also use essentially this same anecdote to support the idea that boring old long-form academic writing that clearly lays things out in as rigorous a way as possible is actually more the root cause that moved the needle here? Even if it ultimately took the engagement of Musk's off the cuff tweets, Gates, or journalists etc., they wouldn't have had something respectable enough to bounce off had Bostrom not given them the book. 

Comment by Spencer Becker-Kahn on Information Loss --> Basin flatness · 2022-05-27T08:29:59.645Z · LW · GW

Thanks again for the reply.

In my notation, something like   or  are functions in and of themselves. The function  evaluates to zero at local minima of 

In my notation, there isn't any such thing as .

But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you. 

Comment by Spencer Becker-Kahn on Information Loss --> Basin flatness · 2022-05-26T12:58:37.393Z · LW · GW

Thanks for the substantive reply.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out.  The loss of a model with parameters  can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function  we have: 

i.e. it's  that might be something like MSE, but the function '' is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like 

We want to differentiate this twice with respect to  essentially. Firstly, we have 

where - just to keep track of this - we've got: 

Or, using 'coordinates' to make it explicit: 

for . Then for  we differentiate again:


This is now at the level of  matrices. Avoiding getting into any depth about tensors and indices, the  term is basically a  tensor-type object and it's paired with  which is a  vector to give something that is .

So what I think you are saying now is that if we are at a local minimum for , then the second term on the right-hand side vanishes (because the term includes the first derivatives of , which are zero at a minimum). You can see however that if the Hessian of  is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of , the Hessian of the loss is equal to a constant times . So maybe you really do want to explicitly assume something like MSE.

I agree that assuming MSE, and looking at a local minimum, you have  . 

(In case it's of interest to anyone, googling turned up this recent paper which studies pretty much exactly the problem of bounding the rank of the Hessian of the loss. They say: "Flatness: A growing number of works [59–61] correlate the choice of regularizers, optimizers, or hyperparameters, with the additional flatness brought about by them at the minimum. However, the significant rank degeneracy of the Hessian, which we have provably established, also points to another source of flatness — that exists as a virtue of the compositional model structure —from the initialization itself. Thus, a prospective avenue of future work would be to compare different architectures based on this inherent kind of flatness.")

Some broader remarks: I think these are nice observations but unfortunately I think generally I'm a bit confused/unclear about what else you might get out of going along these lines. I don't want to sound harsh but just trying to be plain: This is mostly because, as we can see, the mathematical part of what you have said is all very simple, well-established facts about smooth functions and so it would be surprising (to me at least) if some non-trivial observation about deep learning came out from it. In a similar vein, regarding the "cause" of low-rank G, I do think that one could try to bring in a notion of "information loss" in neural networks, but for it to be substantive one needs to be careful that it's not simply a rephrasing of what it means for the Jacobian to have less-than-full rank. Being a bit loose/informal now: To illustrate, just imagine for a moment a real-valued function on an interval. I could say it 'loses information' where its values cannot distinguish between a subset of points. But this is almost the same as just saying: It is constant on some subset...which is of course very close to just saying the derivative vanishes on some subset.  Here, if you describe the phenomena of information loss as concretely as being the situation where some inputs can't be distinguished, then (particularly given that you have to assume these spaces are actually some kind of smooth/differentiable spaces to do the theoretical analysis), you've more or less just built into your description of information loss something that looks a lot like the function being constant along some directions, which means there is a vector in the kernel of the Jacobian. I don't think it's somehow incorrect to point to this but it becomes more like just saying 'perhaps one useful definition of information loss is low rank G' as opposed to linking one phenomenon to the other. 

Sorry for the very long remarks. Of course this is actually because I found it well worth engaging with. And I have a longer-standing personal interest in zero sets of smooth functions!  

Comment by Spencer Becker-Kahn on Information Loss --> Basin flatness · 2022-05-25T12:36:21.183Z · LW · GW

This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an -dimensional parameter space, let's call it  say, and then for each element  of the training set, we can consider the function  which takes in a set of parameters (i.e. a model) and outputs whatever the model does on training data point . We are thinking of both  and  as smooth (or at least sufficiently differentiable) spaces (I take it). 

A contour plane is a level set of one of the , i.e. a set of the form

for some  and . A behavior manifold is a set of the form 

for some .

A more concise way of viewing this is to define a single function  and then a behavior manifold is simply a level set of this function. The map  is a submersion at  if the Jacobian matrix at  is a surjective linear map. The Jacobian matrix is what you call  I think (because the Jacobian is formed with each row equal to a gradient vector with respect to one of the output coordinates). It doesn't matter much because what matters to check the surjectivity is the rank. Then the standard result implies that given , if  is a submersion in a neighbourhood of a point , then  is a smooth -dimensional submanifold in a neighbourhood of  .

Essentially, in a neighbourhood of a point at which the Jacobian of  has full rank, the level set through that point is an -dimensional smooth submanifold.  

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank. But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right? And more generally, in exactly which directions do the implications go? Depending on exactly what you are trying to establish, this could actually be a bit of a 'tip of the iceberg' situation though. (The study of this sort of thing goes rather deep; Vladimir Arnold et al. wrote in their 1998 book: "The theory of singularities of smooth maps is an apparatus for the study of abrupt, jump-like phenomena - bifurcations, perestroikas (restructurings), catastrophes, metamorphoses - which occur in systems depending on parameters when the parameters vary in a smooth manner".)

Similarly when you say things like "Low rank  indicates information loss", I think some care is needed because the paragraphs that follow seem to be getting at something more like: If there is a certain kind of information loss in the early layers of the network, then this leads to low rank . It doesn't seem clear that low rank  is necessarily indicative of information loss?

Comment by Spencer Becker-Kahn on An observation about Hubinger et al.'s framework for learned optimization · 2022-05-24T09:47:05.289Z · LW · GW

Thanks for the comments and pointers!