Posts

Stagewise Development in Neural Networks 2024-03-20T19:54:06.181Z
Simple versus Short: Higher-order degeneracy and error-correction 2024-03-11T07:52:46.307Z
Timaeus's First Four Months 2024-02-28T17:01:53.437Z
Announcing Timaeus 2023-10-22T11:59:03.938Z
Open Call for Research Assistants in Developmental Interpretability 2023-08-30T09:02:59.781Z
Apply for the 2023 Developmental Interpretability Conference! 2023-08-25T07:12:36.097Z
Towards Developmental Interpretability 2023-07-12T19:33:44.788Z
Singularities against the Singularity: Announcing Workshop on Singular Learning Theory and Alignment 2023-04-01T09:58:22.764Z

Comments

Comment by Daniel Murfet (dmurfet) on Some Notes on the mathematics of Toy Autoencoding Problems · 2024-04-24T03:55:38.631Z · LW · GW

Indeed the integrals in the sparse case aren't so bad https://arxiv.org/abs/2310.06301. I don't think the analogy to the Thompson problem is correct, it's similar but qualitatively different (there is a large literature on tight frames that is arguably more relevant).

Comment by Daniel Murfet (dmurfet) on Nature is an infinite sphere whose center is everywhere and circumference is nowhere · 2024-04-03T08:24:27.754Z · LW · GW

Haha this is so intensely on-brand.

Comment by Daniel Murfet (dmurfet) on Are extreme probabilities for P(doom) epistemically justifed? · 2024-03-20T23:44:10.780Z · LW · GW

The kind of superficial linear extrapolation of trendlines can be powerful, perhaps more powerful than usually accepted in many political/social/futurist discussions. In many cases, succesful forecasters by betting on some high level trend lines often outpredict 'experts'.

But it's a very non-gears level model. I think one should be very careful about using this kind of reasoning when for tail-events. 
e.g. this kind of reasoning could lead one to reject development of nuclear weapons. 

 

Agree. In some sense you have to invent all the technology before the stochastic process of technological development looks predictable to you, almost by definition. I'm not sure it is reasonable to ask general "forecasters" about questions that hinge on specific technological change. They're not oracles.

Comment by Daniel Murfet (dmurfet) on More people getting into AI safety should do a PhD · 2024-03-16T19:21:15.720Z · LW · GW

Do you mean the industry labs will take people with MSc and PhD qualifications in CS, math or physics etc and retrain them to be alignment researchers, or do you mean the labs will hire people with undergraduate degrees (or no degree) and train them internally to be alignment researchers?

I don't know how OpenAI or Anthropic look internally, but I know a little about Google and DeepMind through friends, and I have to say the internal incentives and org structure don't strike me as really a very natural environment for producing researchers from scratch.

Comment by Daniel Murfet (dmurfet) on More people getting into AI safety should do a PhD · 2024-03-16T08:44:57.513Z · LW · GW

I think many early-career researchers in AI safety are undervaluing PhDs.

 

I agree with this. To be blunt, it is my impression from reading LW for the last year that a few people in this community seem to have a bit of a chip on their shoulder Re: academia. It certainly has its problems, and academics love nothing more than pointing them out to each other, but you face your problems with the tools you have, and academia is the only system for producing high quality researchers that is going to exist at scale over the next few years (MATS is great, I'm impressed by what Ryan and co are doing, but it's tiny).

I would like to see many more academics in CS, math, physics and adjacent areas start supervising students in AI safety, and more young people go into those PhDs. Also, more people with PhDs in math and physics transitioning to AI safety work.

One problem is that many of the academics who are willing to supervise PhD students in AI safety or related topics are evaporating into industry positions (subliming?). There are also long run trends that make academia relatively less attractive than it was in the past (e.g. rising corporatisation) even putting aside salary comparisons, and access to compute. So I do worry somewhat about how many PhD students in AI safety adjacent fields can actually be produced per year this decade.
 

Comment by Daniel Murfet (dmurfet) on Simple versus Short: Higher-order degeneracy and error-correction · 2024-03-12T00:45:54.468Z · LW · GW

This comment of mine is a bit cheeky, since there are plenty of theoretical computer scientists who think about characterising terms as fixed points, and logic programming is a whole discipline that is about characterising the problem rather than constructing a solution, but broadly speaking I think it is true among less theoretically-minded folks that "program" means "thing constructed step by step from atomic pieces".

Comment by Daniel Murfet (dmurfet) on Simple versus Short: Higher-order degeneracy and error-correction · 2024-03-11T18:58:23.602Z · LW · GW

Maybe I can clarify a few points here:

  • A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
  • Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
  • It's not really accurate to describe regular models as "typical", especially not on LW where we are generally speaking about neural networks when we think of machine learning.
  • It's true that the example presented in this post is, potentially, not typical (it's not a neural network nor is it a standard kind of statistical model). So it's unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
  • A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold "with some caveats" for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
  • I don't claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than "something like description length".
Comment by Daniel Murfet (dmurfet) on A short 'derivation' of Watanabe's Free Energy Formula · 2024-01-30T00:35:14.968Z · LW · GW

Yes, good point, but if the prior is positive it drops out of the asymptotic as it doesn't contribute to the order of vanishing, so you can just ignore it from the start.

Comment by Daniel Murfet (dmurfet) on A short 'derivation' of Watanabe's Free Energy Formula · 2024-01-29T23:49:23.673Z · LW · GW

There was a sign error somewhere, you should be getting + lambda and - (m-1). Regarding the integral from 0 to 1, since the powers involved are even you can do that and double it rather than -1 to 1 (sorry if this doesn't map exactly onto your calculation, I didn't read all the details).

Comment by Daniel Murfet (dmurfet) on What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? · 2024-01-16T09:47:29.604Z · LW · GW

There is some preliminary evidence in favour of the view that transformers approximate a kind of Bayesian inference in-context (by which I mean something like, they look at in-context examples and process them to represent in their activations something like a Bayesian posterior for some "inner" model based on those examples as samples, and then predict using the predictive distribution for that Bayesian posterior). I'll call the hypothesis that this is taking place "virtual Bayesianism".

I'm not saying you should necessarily believe that, for current generation transformers. But fwiw I put some probability on it, and if I had to predict one significant capability advance in the next generation of LLMs it would be to predict that virtual Bayesianism becomes much stronger (in-context learning being a kind of primitive pre-cursor).

Re: the points in your strategic upshots. Given the above, the following question seems quite important to me: putting aside transformers or neural networks, and just working in some abstract context where we consider Bayesian inference on a data distribution that includes sequences of various lengths (i.e. the kinds of distribution that elicits in-context learning), is there a general principle of Bayesian statistics according to which general-purpose search algorithms tend to dominate the Bayesian posterior?

Comment by Daniel Murfet (dmurfet) on Three Types of Constraints in the Space of Agents · 2024-01-16T01:56:30.020Z · LW · GW

In mathematical terms, what separates agents that could arise from natural selection from a generic agent?

To ask a more concrete question, suppose we consider the framework of DeepMind's Population Based Training (PBT), chosen just because I happen to be familiar with it (it's old at this point, not sure what the current thing is in that direction). This method will tend to produce a certain distribution over parametrised agents, different from the distribution you might get by training a single agent in traditional deep RL style. What are the qualitative differences in these inductive biases?

Comment by Daniel Murfet (dmurfet) on You’re Measuring Model Complexity Wrong · 2024-01-11T03:29:16.519Z · LW · GW

This is an open question. In practice it seems to work fine even at strict saddles (i.e. things where there are no negative eigenvalues in the Hessian but there are still negative directions, i.e. they show up at higher than second order in the Taylor series), in the sense that you can get sensible estimates and they indicate something about the way structure is developing, but the theory hasn't caught up yet.

Comment by Daniel Murfet (dmurfet) on Alexander Gietelink Oldenziel's Shortform · 2023-12-17T22:58:59.895Z · LW · GW

I think there's no such thing as parameters, just processes that produce better and better approximations to parameters, and the only "real" measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).

From that point of view trying to conflate parameters  such that  is naive, because  aren't real, only processes that produce better approximations to them are real, and so the  derivatives of  which control such processes are deeply important, and those could be quite different despite  being quite similar.

So I view "local geometry matters" and "the real thing are processes approximating parameters, not parameters" as basically synonymous.

Comment by Daniel Murfet (dmurfet) on Alexander Gietelink Oldenziel's Shortform · 2023-11-27T18:32:48.120Z · LW · GW

You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).

Comment by Daniel Murfet (dmurfet) on Alexander Gietelink Oldenziel's Shortform · 2023-11-27T18:26:18.500Z · LW · GW

Except nobody wants to hear about it at parties.

 

You seem to do OK... 

If they only would take the time to explain things simply you would understand. 

This is an interesting one. I field this comment quite often from undergraduates, and it's hard to carve out enough quiet space in a conversation to explain what they're doing wrong. In a way the proliferation of math on YouTube might be exacerbating this hard step from tourist to troubadour.

Comment by Daniel Murfet (dmurfet) on Alexander Gietelink Oldenziel's Shortform · 2023-11-27T18:21:15.336Z · LW · GW

As a supervisor of numerous MSc and PhD students in mathematics, when someone finishes a math degree and considers a job, the tradeoffs are usually between meaning, income, freedom, evil, etc., with some of the obvious choices being high/low along (relatively?) obvious axes. It's extremely striking to see young talented people with math or physics (or CS) backgrounds going into technical AI alignment roles in big labs, apparently maximising along many (or all) of these axes!

Especially in light of recent events I suspect that this phenomenon, which appears too good to be true, actually is.

Comment by Daniel Murfet (dmurfet) on Alexander Gietelink Oldenziel's Shortform · 2023-11-27T18:16:15.298Z · LW · GW

Please develop this question as a documentary special, for lapsed-Starcraft player homeschooling dads everywhere.

Comment by Daniel Murfet (dmurfet) on Public Call for Interest in Mathematical Alignment · 2023-11-22T17:53:30.226Z · LW · GW

Thanks for setting this up!

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-21T09:02:54.826Z · LW · GW

I don't understand the strong link between Kolmogorov complexity and generalisation you're suggesting here. I think by "generalisation" you must mean something more than "low test error". Do you mean something like "out of distribution" generalisation (whatever that means)?

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-21T09:00:29.987Z · LW · GW

Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).

I think that expression of Jesse's is also correct, in context.

However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like "generalisation" in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn't make enough effort to prevent them from drawing these inferences.

I have noticed people at the Berkeley meeting and elsewhere believing (ii) was somehow resolved by SLT, or just in a vague sense thinking SLT says something more than it does. While there are hard tradeoffs to make in writing expository work, I think your criticism of this aspect of the messaging around SLT on LW is fair and to the extent it misleads people it is doing a disservice to the ongoing scientific work on this important subject. 

I'm often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work. So I'm grateful for the chance to have this conversation (and to anybody reading this who sees other areas where they think we're falling short, read this as an invitation to let me know, either privately or in posts like this).

I'll discuss the generalisation topic further with the authors of those posts. I don't want to pre-empt their point of view, but it seems likely we may go back and add some context on (i) vs (ii) in those posts or in comments, or we may just refer people to this post for additional context. Does that sound reasonable?

At least right now, the value proposition I see of SLT lies not in explaining the "generalisation puzzle" but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-21T08:24:38.262Z · LW · GW

However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)


Seems reasonable to me!

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-20T19:05:35.286Z · LW · GW

Re: the articles you link to. I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion. 

I agree that Jesse's post has a title "Neural networks generalize because of this one weird trick" which is clickbaity, since SLT does not in fact yet explain why neural networks appear to generalise well on many natural datasets. However the actual article is more nuanced, saying things like "SLT seems like a promising route to develop a better understanding of generalization and the limiting dynamics of training". Jesse gives a long list of obstacles to walking this route. I can't find anything in the post itself to object to. Maybe you think its optimism is misplaced, and fair enough.

So I don't really understand what claims about inductive bias or generalisation behaviour in these posts you think is invalid?

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-20T18:52:38.768Z · LW · GW

I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?


That seems probable. Maybe it's useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we're on the same page. When people refer to the "generalisation puzzle" in deep learning I think they mean two related but distinct things: 

(i) the general question about how it is possible for overparametrised models to have good generalisation error, despite classical interpretations of Occam's razor like the BIC 
(ii) the specific question of why neural networks, among all possible overparametrised models, actually have good generalisation error in practice (saying this is possible is much weaker than actually explaining why it happens).

In my mind SLT comes close to resolving (i), modulo a bunch of questions which include: whether the asymptotic limit taking the dataset size to infinity is appropriate in practice, the relationship between Bayesian generalisation error and test error in the ML sense (comes down largely to Bayesian posterior vs SGD), and whether hypotheses like relative finite variance are appropriate in the settings we care about. If all those points were treated in a mathematically satisfactory way, I would feel that the general question is completely resolved by SLT.

Informally, knowing SLT just dispels the mystery of (i) sufficiently that I don't feel personally motivated to resolve all these points, although I hope people work on them. One technical note on this: there are some brief notes in SLT6 arguing that "test error" as a model selection principle in ML, presuming some relation between the Bayesian posterior and SGD, is similar to selecting models based on what Watanabe calls the Gibbs generalisation error, which is computed by both the RLCT and singular fluctuation. Since I don't think it's crucial to our discussion I'll just elide the difference between Gibbs generalisation error in the Bayesian framework and test error in ML, but we can return to that if it actually contains important disagreement.

Anyway I'm guessing you're probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).

Any theoretical resolution to (ii) has to involve some nontrivial ingredient that actually talks about neural networks, as opposed to general singular statistical models. The only specific results about neural networks and generalisation in SLT are the old results about RLCTs of tanh networks, more recent bounds on shallow ReLU networks, and Aoyagi's upcoming results on RLCTs of deep linear networks (particularly that the RLCT is bounded above even when you take the depth to infinity). 

As I currently understand them, these results are far from resolving (ii). In its current form SLT doesn't supply any deep reason for why neural networks in particular are often observed to generalise well when you train them on a range of what we consider "natural" datasets. We don't understand what distinguishes neural networks from generic singular models, nor what we mean by "natural". These seem like hard problems, and at present it looks like one has to tackle them in some form to really answer (ii).

Maybe that has significant overlap with the critique of SLT you're making?

Nonetheless I think SLT reduces the problem in a way that seems nontrivial. If we boil the "ML in-practice model selection" story to "choose the model with the best test error given fixed training steps" and allow some hand-waving in the connection between training steps and number of samples, Gibbs generalisation error and test error etc, and use Watanabe's theorems (see Appendix B.1 of the quantifying degeneracy paper for a local formulation) to write the Gibbs generalisation error as



where  is the learning coefficient and  is the singular fluctuation and  is roughly the loss (the quantity that we can estimate from samples is actually slightly different, I'll elide this) then (ii), which asks why neural networks on natural datasets have low generalisation error, is at least reduced to the question of why neural networks on natural datasets have low .

I don't know much about this question, and agree it is important and outstanding.

Again, I think this reduction is not trivial since the link between  and generalisation error is nontrivial. Maybe at the end of the day this is the main thing we in fact disagree on :)

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-20T08:16:09.867Z · LW · GW

The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given by

and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.


In your example there are many values of the parameters that encode the zero function (e.g.  and all other parameters free) in addition to there being many parameters that encode the function  (e.g. , variables  free and ). Without thinking about it more I'm not sure which is actually has local learning coefficient (RLCT) and therefore counts as "more simple" from an SLT perspective.

However, if I understand correctly it's not this specific example that you care about. We can agree that there is some way of coming up with a simple model which (a) can represent both the functions  and  and (b) has parameters  and  respectively representing these functions with local learning coefficients . That is, according to the local learning coefficient as a measure of model complexity, the neighbourhood of the parameter  is more complex than that of . I believe your observation is that this contradicts an a priori notion of complexity that you hold about these functions.

Is that a fair characterisation of the argument you want to make?

Assuming it is, my response is as follows. I'm guessing you think  is simpler than  because the former function can be encoded by a shorter code on a UTM than the latter. But this isn't the kind of complexity that SLT talks about: the local learning coefficient  that appears in the main theorems represents the complexity of representing a given probability distribution  using parameters from the model, and is not some intrinsic model-free complexity of the distribution itself.

One way of saying it is that Kolmogorov complexity is the entropy cost of specifying a machine on the description tape of a UTM (a kind of absolute measure) whereas the local learning coefficient is the entropy cost per sample of incrementally refining an almost true parameter in the neural network parameter space (a kind of relative measure). I believe they're related but not the same notion, as the latter refers fundamentally to a search process that is missing in the former.

We can certainly imagine a learning machine set up in such a way that it is prohibitively expensive to refine an almost true parameter nearby a solution that looks like  and very cheap to refine an almost true parameter near a solution like , despite that being against our natural inclination to think of the former as simpler. It's about the nature of the refinement / search process, not directly about the intrinsic complexity of the functions.

So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I'll just post this as-is and make sure I'm not confused about your views up to this point.

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-19T16:52:26.001Z · LW · GW

First of all, SLT is largely is based on examining the behaviour of learning machines in the limit of infinite data


I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).

I've discussed this a bit with my colleague Liam Hodgkinson, whose recent papers https://arxiv.org/abs/2307.07785 and https://arxiv.org/abs/2311.07013 might be more up your alley than SLT.

My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard. So far we have been pleasantly surprised at how well the free energy formula works at relatively low  (in e.g. https://arxiv.org/abs/2310.06301). It remains an open question whether this asymptotic continues to provide useful insight into larger models with the kind of dataset size we're using in LLMs for example.

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-19T16:46:47.572Z · LW · GW

I think that the significance of SLT is somewhat over-hyped at the moment


Haha, on LW that is either already true or at current growth rates will soon be true, but it is clearly also the case that SLT remains basically unknown in the broader deep learning theory community.

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-19T16:44:45.767Z · LW · GW

I claim that this is fairly uninteresting, because classical statistical learning theory already gives us a fully adequate account of generalisation in this setting which applies to all learning machines, including neural networks

 

I'm a bit familiar with the PAC-Bayes literature and I think this might be an exaggeration. The linked post merely says that the traditional PAC-Bayes setup must be relaxed, and sketches some ways of doing so. Could you please cite the precise theorem you have in mind?

Comment by Daniel Murfet (dmurfet) on My Criticism of Singular Learning Theory · 2023-11-19T16:32:18.090Z · LW · GW

Very loosely speaking, regions with a low RLCT have a larger "volume" than regions with high RLCT, and the impact of this fact eventually dominates other relevant factors.

 

I'm going to make a few comments as I read through this, but first I'd like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn't have done otherwise.

Regarding the point about volume. It is true that the RLCT can be written as (Theorem 7.1 of Watanabe's book "Algebraic Geometry and Statistical Learning Theory")

where  is the volume (according to the measure associated to the prior) of the set of parameters  with KL divergence  between the model and truth less than . For small  we have  where  is the multiplicity. Thus near critical points  with lower RLCT small changes in the cutoff  near  tend to change the volume of the set of almost true parameters more than near critical points with higher RLCTs.

My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space, and so you read the asymptotic formula for the free energy (where  is a region of parameter space containing a critical point )



as having a  term that does little more than prefer critical points  that tend to dominate large regions of parameter space according to the prior. If that were true, I would agree this would be underwhelming (or at least, precisely as "whelming" as the BIC, and therefore not adding much beyond the classical story).

However this isn't what the free energy formula says. Indeed the volume  is a term that contributes only to the constant order term (this is sketched in Chen et al). 

I claim it's better to think of the learning coefficient  as being a measure of how many bits it takes to specify an almost true parameter with  once you know a parameter with , which is "microscopic" rather than "macroscopic" statement. That is, lower  means that a fixed decrease  is "cheaper" in terms of entropy generated.

So the free energy formula isn't saying "critical points  dominating large regions tend to dominate the posterior at large " but rather "critical points  which require fewer bits / less entropy to achieve a fixed  dominate the posterior for large ". The former statement is both false and uninteresting, the second statement is true and interesting (or I think so anyway).

Comment by Daniel Murfet (dmurfet) on Growth and Form in a Toy Model of Superposition · 2023-11-13T21:48:19.940Z · LW · GW

Good question. What counts as a "-" is spelled out in the paper, but it's only outlined here heuristically. The "5 like" thing it seems to go near on the way down is not actually a critical point.

Comment by Daniel Murfet (dmurfet) on Growth and Form in a Toy Model of Superposition · 2023-11-13T21:47:03.592Z · LW · GW

The change in the matrix W and the bias b happen at the same time, it's not a lagging indicator.

Comment by Daniel Murfet (dmurfet) on Singular learning theory and bridging from ML to brain emulations · 2023-11-01T22:27:07.929Z · LW · GW

SLT predicts when this will happen!

Maybe. This is potentially part of the explanation for "data double descent" although I haven't thought about it beyond the 5min I spent writing that page and the 30min I spent talking about it with you at the June conference. I'd be very interested to see someone explore this more systematically (e.g. in the setting of Anthropic's "other" TMS paper https://www.anthropic.com/index/superposition-memorization-and-double-descent which contains data double descent in a setting where the theory of our recent TMS paper might allow you to do something).

Comment by Daniel Murfet (dmurfet) on Singular learning theory and bridging from ML to brain emulations · 2023-11-01T22:16:53.954Z · LW · GW

There is quite a large literature on "stage-wise development" in neuroscience and psychology, going back to people like Piaget but quite extensively developed in both theoretical and experimental directions. One concrete place to start on the agenda you're outlining here might be to systematically survey that literature from an SLT-informed perspective. 

Comment by Daniel Murfet (dmurfet) on Singular learning theory and bridging from ML to brain emulations · 2023-11-01T22:13:10.412Z · LW · GW

we can copy the relevant parts of the human brain which does the things our analysis of our models said they would do wrong, either empirically (informed by theory of course), or purely theoretically if we just need a little bit of inspiration for what the relevant formats need to look like.

I struggle to follow you guys in this part of the dialogue, could you unpack this a bit for me please?

Comment by Daniel Murfet (dmurfet) on Singular learning theory and bridging from ML to brain emulations · 2023-11-01T22:08:06.630Z · LW · GW

Though I'm not fully confident that is indeed what they did

The k-gons are critical points of the loss, and as  varies the free energy is determined by integrals restricted to neighbourhoods of these critical points in weight space.

Comment by Daniel Murfet (dmurfet) on Singular learning theory and bridging from ML to brain emulations · 2023-11-01T22:05:54.413Z · LW · GW

nonsingular

singular

Comment by Daniel Murfet (dmurfet) on Singular learning theory and bridging from ML to brain emulations · 2023-11-01T22:05:29.153Z · LW · GW

because a physicist made these notes

Grumble :)

Comment by Daniel Murfet (dmurfet) on Announcing Timaeus · 2023-10-24T21:58:48.733Z · LW · GW

Thanks Akash. Speaking for myself, I have plenty of experience supervising MSc and PhD students and running an academic research group, but scientific institution building is a next-level problem. I have spent time reading and thinking about it, but it would be great to be connected to people with first-hand experience or who have thought more deeply about it, e.g.

  • People with advice on running distributed scientific research groups
  • People who have thought about scientific institution building in general (e.g. those with experience starting FROs in biosciences or elsewhere)
  • People with experience balancing fundamental research with product development within an institution

I am seeking advice from people within my institution (University of Melbourne) but Timaeus is not a purely academic org and their experience does not cover all the hard parts.

Comment by Daniel Murfet (dmurfet) on Announcing Timaeus · 2023-10-24T21:49:51.856Z · LW · GW

That's what we're thinking, yeah.

Comment by Daniel Murfet (dmurfet) on Announcing Timaeus · 2023-10-23T01:36:03.082Z · LW · GW

Yes I think that's right. I haven't closely read the post you link to (but it's interesting and I'm glad to have it brought to my attention, thanks) but it seems related to the kind of dynamical transitions we talk briefly about in the Related Works section of Chen et al.

Comment by Daniel Murfet (dmurfet) on Announcing Timaeus · 2023-10-22T21:20:06.154Z · LW · GW

I think it is too early to know how many phase transitions there are in e.g. the training of a large language model. If there are many, it seems likely to me that they fall along a spectrum of "scale" and that it will be easier to find the more significant ones than the less significant ones (e.g. we discover transitions like the onset of in-context learning first, because they dramatically change how the whole network computes).

As evidence for that view, I would put forward the fact that putting features into superposition is known to be a phase transition in toy models (based on the original post by Elhage et al and also our work in Chen et al) and therefore seems likely to be a phase transition in larger models as well. That gives an example of phase transitions at the "small" end of the scale.  At the "big" end of the scale, the evidence in Olsson et al that induction heads and in-context learning appears in a phase transition seems convincing to me.

On general principles, understanding "small" phase transitions (where the scale is judged relative to the overall size of the system, e.g. number of parameters) is like probing a physical system at small length scales / high energy, and will require more sophisticated tools. So I expect that we'll start by gaining a good understanding of "big" phase transitions and then as the experimental methodology and theory improves, move down the spectrum towards smaller transitions.

On these grounds I don't expect us to be swamped by the smaller transitions, because they're just hard to see in the first place; the major open problem in my mind is how far we can get down the scale with reasonable amounts of compute. Maybe one way that SLT & developmental interpretability fails to be useful for alignment is if there is a large "gap" in the spectrum, where beyond the "big" phase transitions that are easy to see (and for which you may not need fancy new ideas) there is just a desert / lack of transitions, and all the transitions that matter for alignment are "small" enough that a lot of compute and/or very sophisticated ideas are necessary to study them. We'll see!

Comment by Daniel Murfet (dmurfet) on Announcing Timaeus · 2023-10-22T21:03:25.102Z · LW · GW

Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.

To clarify the two types of phase transitions involved here, in the terminology of Chen et al:

  • Bayesian phase transition in number of samples: as discussed in the post you link to in Liam's sequence, where the concentration of the Bayesian posterior shifts suddenly from one region of parameter space to another, as the number of samples increased past some critical sample size . There are also Bayesian phase transitions with respect to hyperparameters (such as variations in the true distribution) but those are not what we're talking about here.
  • Dynamical phase transitions: the "backwards S-shaped loss curve". I don't believe there is an agreed-upon formal definition of what people mean by this kind of phase transition in the deep learning literature, but what we mean by it is that the SGD trajectory is for some time strongly influenced (e.g. in the neighbourhood of) a critical point  and then strongly influenced by another critical point . In the clearest case there are two plateaus, the one with higher loss corresponding to the label  and the one with the lower loss corresponding to . In larger systems there may not be a clear plateau (e.g. in the case of induction heads that you mention) but it may still reasonable to think of the trajectory as dominated by the critical points.

The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.

However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition  has "standing behind it" a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near  to being concentrated near .

It turns out that, at least for sufficiently large , the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near  should be higher than near . This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.

This has been done in the Toy Model of Superposition in Chen et al, and we're in the process of looking at a range of larger systems including induction heads. We're not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob's nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.

There are some caveats and details, that I can go into if you're interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.

Comment by Daniel Murfet (dmurfet) on Investigating the learning coefficient of modular addition: hackathon project · 2023-10-17T21:45:38.655Z · LW · GW

Oh that makes a lot of sense, yes.

Comment by Daniel Murfet (dmurfet) on Investigating the learning coefficient of modular addition: hackathon project · 2023-10-17T21:23:17.221Z · LW · GW

To see this, we use a slight refinement of the dynamical estimator, where we restrict sampling to lie within the normal hyperplane of the gradient vector at initialization, which seems to make this behavior more robust.

 

Could you explain the intuition behind using the gradient vector at initialization? Is this based on some understanding of the global training dynamics of this particular network on this dataset?

Comment by Daniel Murfet (dmurfet) on You’re Measuring Model Complexity Wrong · 2023-10-12T21:06:49.235Z · LW · GW

Not easily detected.  As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.

Comment by Daniel Murfet (dmurfet) on A list of core AI safety problems and how I hope to solve them · 2023-08-28T12:58:59.961Z · LW · GW

Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.

Comment by Daniel Murfet (dmurfet) on A list of core AI safety problems and how I hope to solve them · 2023-08-27T20:54:29.124Z · LW · GW

4. Goals misgeneralize out of distribution.

See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning

OAA Solution: (4.1) Use formal methods with verifiable proof certificates[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.

 

Based on the Bold Plan post and this one my main point of concern is that I don't believe in the feasibility of the model checking, even in principle. The state space S and action space A of the world model will be too large for techniques along the lines of COOL-MC which (if I understand correctly) have to first assemble a discrete-time Markov chain by querying the NN and then try to apply formal verification methods to that. I imagine that actually you are thinking of learned coarse-graining of both S and A, to which one applies something like formal verification.

Assuming that's correct, then there's an inevitable lack of precision on the inputs to the formal verification step. You have to either run the COOL-MC-like process until you hit your time and compute budget and then accept that you're missing state-action pairs, or you coarse-grain to some degree within your budget and accept a dependence on the quality of your coarse-graining. If you're doing an end-run around this tradeoff somehow, could you direct me to where I can read more about the solution?

I know there's literature on learned coarse-grainings of S and A in the deep RL setting, but I haven't seen it combined with formal verification. Is there a literature? It seems important.

I'm guessing that this passage in the Bold Plan post contains your answer:

> Defining a sufficiently expressive formal meta-ontology for world-models with multiple scientific explanations at different levels of abstraction (and spatial and temporal granularity) having overlapping domains of validity, with all combinations of {Discrete, Continuous} and {time, state, space}, and using an infra-bayesian notion of epistemic state (specifically, convex compact down-closed subsets of subprobability space) in place of a Bayesian state

In which case I see where you're going, but this seems like the hard part?

Comment by Daniel Murfet (dmurfet) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T09:20:26.993Z · LW · GW

Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.

 

This dramatically undersells the potential impact of Olsson et al. You can't dismiss modus ponens as "just regex". That's the heart of logic!

For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they are not convinced that anything like reasoning is taking place.

I view Olsson et al as nontrivial evidence for the emergence of internal computations resembling reasoning, with increasing scale. That's profound. If that case is made stronger over time by interpretability (as I expect it to be) the scientific, philosophical and societal impact will be immense.

Comment by Daniel Murfet (dmurfet) on Towards Developmental Interpretability · 2023-07-15T05:29:37.324Z · LW · GW

That intuition sounds reasonable to me, but I don't have strong opinions about it.

One thing to note is that training and test performance are lagging indicators of phase transitions. In our limited experience so far, measures such as the RLCT do seem to indicate that a transition is underway earlier (e.g. in Toy Models of Superposition), but in the scenario you describe I don't know if it's early enough to detect structure formation "when it starts". 

For what it's worth my guess is that the information you need to understand the structure is present at the transition itself, and you don't need to "rewind" SGD to examine the structure forming one step at a time.

Comment by Daniel Murfet (dmurfet) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2023-07-12T05:26:31.685Z · LW · GW

If the cost is a problem for you, send a postal address to daniel.murfet@gmail.com and I'll mail you my physical copy. 

Comment by Daniel Murfet (dmurfet) on A Defense of Work on Mathematical AI Safety · 2023-07-06T20:58:25.553Z · LW · GW

Thanks for the article. For what it's worth, here's the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I'm a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I'd be interested if you disagree with this framing.

We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse a couple of waves after that. We don't know how many waves there are, but it is reasonable to expect that beyond the early waves our intuitions probably aren't worth much. 

That's not a surprise! As physics probed smaller scales, at some point our intuitions stopped being worth anything, and we switched to relying heavily on abstract mathematics (which became a source of different, more hard-won intuitions). Similarly, we can expect that as we scale up our learning machines, we will enter a regime where current intuitions fail to be useful. At the same time, the systems may be approaching more optimal agents, and theories like Agent Foundations start to provide a very useful framework for reasoning about the nature of the alignment problem.

So in short I think of Agent Foundations as like quantum mechanics: a bit strange perhaps, but when push comes to shove, one of the few sources of intuition we have about waves 4, 5, 6, ... of the alignment problem. It would be foolish to bet everything on solving waves 1, 2, 3 and then be empty handed when wave 4 arrives.