Posts
Comments
Looks like a conspiracy of pigeons posing as lw commenters have downvoted your post
Thanks!
I haven't grokked your loss scales explanation (the "interpretability insights" section) without reading your other post though.
Not saying anything deep here. The point is just that you might have two cartoon pictures:
- every correctly classified input is either the result of a memorizing circuit or of a single coherent generalizing circuit behavior. If you remove a single generalizing circuit, your accuracy will degrade additively.
- a correctly classified input is the result of a "combined" circuit consisting of multiple parallel generalizing "subprocesses" giving independent predictions, and if you remove any of these subprocesses, your accuracy will degrade multiplicatively.
A lot of ML work only thinks about picture #1 (which is the natural picture to look at if you only have one generalizing circuit and every other circuit is a memorization). But the thing I'm saying is that picture #2 also occurs, and in some sense is "the info-theoretic default" (though both occur simultaneously -- this is also related to the ideas in this post)
Thanks for the questions!
You first introduce the SLT argument that tells us which loss scale to choose (the "Watanabe scale", derived from the Watanabe critical temperature).
Sorry, I think the context of the Watanabe scale is a bit confusing. I'm saying that in fact it's the wrong scale to use as a "natural scale". The Watanabe scale depends only on the number of training datapoints, and doesn't notice any other properties of your NN or your phenomenon of interest.
Roughly, the Watanabe scale is the scale on which loss improves if you memorize a single datapoint (so memorizing improves accuracy by 1/n with n = #(training set) and in a suitable operationalization, improves loss by , and this is the Watanabe scale).
It's used in SLT roughly because it's the minimal temperature scale where "memorization doesn't count as relevant", and so relevant measurements become independent of the n-point sample. However in most interp experiments, the realistic loss reconstruction loss reconstruction is much rougher (i.e., further from optimal loss) than the 1/n scale where memorization becomes an issue (even if you conceptualize #(training set) as some small synthetic training set that you were running the experiment on).
For your second question: again, what I wrote is confusing and I really want to rewrite it more clearly later. I tried to clarify what I think you're asking about in this shortform. Roughly, the point here is that to avoid having your results messed up by spurious behaviors, you might want to degrade as much as possible while still observing the effect of your experiment. The idea is that if you found any degradation that wasn't explicitly designed with your experiment in mind (i.e., is natural), but where you see your experimental results hold, then you have "found a phenomenon". The hope is that if you look at the roughest such scale, you might kill enough confounders and interactions to make your result be "clean" (or at least cleaner): so for example optimistically you might hope to explain all the loss of the degraded model at the degradation scale you chose (whereas at other scales, there are a bunch of other effects improving the loss on the dataset you're looking at that you're not capturing in the explanation).
The question now is when degrading, what order you want to "kill confounders" in to optimally purify the effect you're considering. The "natural degradation" idea seems like a good place to look since it kills the "small but annoying" confounders: things like memorization, weird specific connotations of the test sentences you used for your experiment, etc. Another reasonable place to look is training checkpoints, as these correspond to killing "hard to learn" effects. Ideally you'd perform several kinds of degradation to "maximally purify" your effect. Here the "natural scales" (loss on the level Claude 1 e.g., or Bert) are much too fine for most modern experiments, and I'm envisioning something much rougher.
The intuition here comes from physics. Like if you want to study properties of a hydrogen atom that you don't see either in water or in hydrogen gas, a natural thing to do is to heat up hydrogen gas to extreme temperatures where the molecules degrade but the atoms are still present, now in "pure" form. Of course not all phenomena can be purified in this way (some are confounded by effects both at higher and at lower temperature, etc.).
Thanks! Yes the temperature picture is the direction I'm going in. I had heard the term "rate distortion", but didn't realize the connection with this picture. Might have to change the language for my next post
This seems overstated
In some sense this is the definition of the complexity of an ML algorithm; more precisely, the direct analog of complexity in information theory, which is the "entropy" or "Solomonoff complexity" measurement, is the free energy (I'm writing a distillation on this but it is a standard result). The relevant question then becomes whether the "SGLD" sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in this paper for example.
A possibly more interesting question is whether notions of complexity in interpretations of programs agree with the inherent complexity as measured by free energy. The place I'm aware of where this is operationalized and checked is our project with Nina on modular addition: here we do have a clear understanding of the platonic complexity, and the local learning coefficient does a very good job of asymptotically capturing it with very good precision (both for memorizing and generalizing algorithms, where the complexity difference is very significant).
Citation? [for Apollo]
Look at this paper (note I haven't read it yet). I think their LIB work is also promising (at least it separates circuits of small algorithms)
Thanks for the reference, and thanks for providing an informed point of view here. I would love to have more of a debate here, and would quite like being wrong as I like tropical geometry.
First, about your concrete question:
As I understand it, here the notion of "density of polygons' is used as a kind of proxy for the derivative of a PL function?
Density is a proxy for the second derivative: indeed, the closer a function is to linear, the easier it is to approximate it by a linear function. I think a similar idea occurs in 3D graphics, in mesh optimization, where you can improve performance by reducing the number of cells in flatter domains (I don't understand this field, but this is done in this paper according to some energy curvature-related energy functional). The question of "derivative change when crossing walls" seems similar. In general, glancing at the paper you sent, it looks like polyhedral currents are a locally polynomial PL generalization of currents of ordinary functions (and it seems that there is some interesting connection made to intersection theory/analogues of Chow theory, though I don't have nearly enough background to read this part carefully). Since the purpose of PL functions in ML is to approximate some (approximately smooth, but fractally messy and stochastic) "true classification", I don't see why one wouldn't just use ordinary currents here (currents on a PL manifold can be made sense of after smoothing, or in a distribution-valued sense, etc.).
In general, I think the central crux between us is whether or not this is true:
tropical geometry might be relevant ML, for the simple reason that the functions coming up in ML with ReLU activation are PL
I'm not sure I agree with this argument. The use of PL functions is by no means central to ML theory, and is an incidental aspect of early algorithms. The most efficient activation functions for most problems tend to not be ReLUs, though the question of activation functions is often somewhat moot due to the universal approximation theorem (and the fact that, in practice, at least for shallow NNs anything implementable by one reasonable activation tends to be easily implementable, with similar macroscopic properties, by any other). So the reason that PL functions come up is that they're "good enough to approximate any function" (and also "asymptotic linearity" seems genuinely useful to avoid some explosion behaviors). But by the same token, you might expect people who think deeply about polynomial functions to be good at doing analysis because of the Stone-Weierstrass theorem.
More concretely, I think there are two core "type mismatches" between tropical geometry and the kinds of questions that appear in ML:
- Algebraic geometry in general (including tropical geometry) isn't good at dealing with deep compositions of functions, and especially approximate compositions.
- (More specific to TG): the polytopes that appear in neural nets are as I explained inherently random (the typical interpretation we have of even combinatorial algorithms like modular addition is that the PL functions produce some random sharding of some polynomial function). This is a very strange thing to consider from the point of view of a tropical geometer: like as an algebraic geometer, it's hard for me to imagine a case where "this polynomial has degree approximately 5... it might be 4 or 6, but the difference between them is small". I simply can't think of any behavior that is at all meaningful from an AG-like perspective where the questions of fan combinatorics and degrees of polynomials are replaced by questions of approximate equality.
I can see myself changing my view if I see some nontrivial concrete prediction or idea that tropical geometry can provide in this context. I think a "relaxed" form of this question (where I genuinely haven't looked at the literature) is whether tropical geometry has ever been useful (either in proving something or at least in reconceptualizing something in an interesting way) in linear programming. I think if I see a convincing affirmative answer to this relaxed question, I would be a little more sympathetic here. However, the type signature here really does seem off to me.
If I understand correctly, you want a way of thinking about a reference class of programs that has some specific, perhaps interpretability-relevant or compression-related properties in common with the deterministic program you're studying?
I think in this case I'd actually say the tempered Bayesian posterior by itself isn't enough, since even if you work locally in a basin, it might not preserve the specific features you want. In this case I'd probably still start with the tempered Bayesian posterior, but then also condition on the specific properties/explicit features/ etc. that you want to preserve. (I might be misunderstanding your comment though)
Statistical localization in disordered systems, and dreaming of more realistic interpretability endpoints
[epistemic status: half fever dream, half something I think is an important point to get across. Note that the physics I discuss is not my field though close to my interests. I have not carefully engaged with it or read the relevant papers -- I am likely to be wrong about the statements made and the language used.]
A frequent discussion I get into in the context of AI is "what is an endpoint for interpretability". I get into this argument from two sides:
- arguing with interpretability purists, who say that the only way to get robust safety from interpretability is to mathematically prove that behaviors are safe and/or no deception is going on.
- arguing with interpretability skeptics, who say that the only way to get robust safety from interpretability is to prove that behaviors are safe and/or no deception is going on.
My typical response to this is that no, you're being silly: imagine discussing any other phenomenon in this way: "the only way to show that the sun will rise tomorrow is to completely model the sun on the level of subatomic particles and prove that they will not spontaneously explode". Or asking a bridge safety expert to model every single particle and provably lower-bound the probability of them losing structural coherence in a way not observed by bulk models.
But there's a more fundamental intuition here, that I started developing when I started trying to learn statistical physics. There are a few lossy ways of expressing it. One is to talk about renormalization, how assumptions about renormalizability of systems is a "theorem" in statistical mechanics, but is not (and probably never will be) proven mathematically, (in some sense, it feels much more like a "truly new flavor of axiom" than even complexity-theoretic things like P vs. NP). But that's still not it. There is a more general intuition, that's hard to get across (in particular for someone who, like me, is only a dabbler in the subject) -- that some genuinely incredibly complex and information-laden systems have some "strong locality" properties, which are (insofar as the physical meaning of the word holds meaning) both provable and very robust to changing and expanding the context.
For a while, I thought that this is just a vibe -- a way to guide thinking, but not something that can be operationalized in a way that may significantly convince people without a similar intuition.
However, recently I've become more hopeful that an "explicitly formalizable" notion of robust interpretability may fall out of this language in a somewhat natural way.
This is closely related to recent discussions and writeups we've been doing with Lauren Greenspan on scale and renormalization in (statistical) QFT and connections to ML.
One direction to operationalize this is through the notion of "localization" in statistical physics, and in particular "Anderson localization". The idea (if I understand it correctly) is that in certain disordered systems (think of a semiconductor, which is an "ordered" metal with a disordered system of "impurity atoms" sprinkled inside), you can prove a kind of screening property: that from the point of view of the localized dynamics near a particular spin, you can provably ignore spins far away from the point you're studying (or rather, replace them by an "ordered" field that modifies the local dynamics in a fully controllable way). This idea of of local interactions being "screened" from far-away details is ubiquitous. In a very large and very robust class of systems, interactions are purely local, except for mediation by a small number of hierarchical "smooth" couplings that see only high-level summary statistics of the "non-local" spins and treat them as a background -- and moreover, these "locality" properties are provable (insofar as we assume the extra "axioms" of thermodynamics), assuming some (once again, hierarchical and robustly adjustable) assumptions of independence. There are a number of related principles here that (if I understand correctly) get used in similar contexts, sometimes interchangeably: one I liked is "local perturbations perturb locally" ("LPPL") from this paper.
Note that in the above paragraph I did something I generally disapprove of: I am trying to extract and verbalize "vibes" from science that I don't understand on a concrete level, and I am almost certainly getting a bunch of things wrong. But I don't know of another way of gesturing in a "look, there's something here and it's worth looking into" way without doing this to some extent.
Now AI systems, just like semiconductors, are statistical systems with a lot of disorder. In particular in a standard operationalization (as e.g. in PDLT), we can conceptualize of neural nets as a field theory. There is a "vacuum theory" that depends only on the architecture, and then adding new datapoints corresponds to adding particles. PDLT only studies a certain perturbative picture here, but it seems plausible that an extension of these techniques may extend to non-perturbative scales (and hope for this is a big part of the reason that Lauren and I have been thinking and writing about renormalization). In a "dream" version of such an extension, the datapoints would form a kind of disordered system, with both ordered components, hierarchical relationships, and some assumption of inherent randomness outside of the relationships. A great aspect of "numerical" QFT, such as gets applied in condensed matter models, is that you don't need a really great model of the hierarchical relationships: sometimes you can just play around and turn on a handful of extra parameters until you find something that works. (Again, at the moment this is an imprecise interpretation of things I have not deeply engaged with.)
Of course doing this makes some assumptions -- but the assumptions are on the level of the data (i.e. particles), not the weights/ model internals (i.e., fields -- the place where we are worried about misalignment, etc.). And if you assume these assumptions and write down a "localization theorem" result, then plausibly the kind of statement you will get is something along the lines of the following:
"the way this LLM is completing this sentence is a combination of a sophisticated collection of hierarchical relationships, but I know that the behavior here is equivalent to behaviors on other similar sentences up to small (provably) low-complexity perturbations".
More generally, the kinds of information this kind of picture would give is a kind of "local provably robust interpretability" -- where the text completion behavior of a model is provably (under suitable "disordered system" assumptions) reducible to a collection of several local circuits that depend on understandable phenomena at a few different scales. A guiding "complexity intuition" for me here is provided by the notrivial but tractable grammar task diagrams in the paper Marks et al. (See pages 25-27, and note the shape of these diagrams is more or less straightup typical of the shape of a nonrenormalized interaction diagram you see before you start applying renormalization to simplify a statistical system).
An important caveat here is that in physical models of this type (and in pictures that include renormalization more generally), one does not make -- or assume -- any "fundamentality" assumptions. In many cases a number of alternative (but equivalent, once the "screening" is factored in) pictures exist, with various levels of granularity, elegance, etc. (this already can be seen in the 2D Ising model -- a simple magnet model -- where the same behaviors can either be understood in a combinatorial "spin-to-spin interaction" way, which mirrors the "fundamental interpretability" desires of mechinterp, and through this "recursive screening out" model that is more renormalization-flavored; the results are the same (to a very high level of precision), even when looking at very localized effects involving collections of a few spins. So the question of whether an interpretation is "fundamental" or uses the "right latents" is to a large extent obviated here; the world of thermodynamics is much more anarchical and democratic than the world of mathematical formalism and "elegant proof", at least in this context.
Having handwavily described a putative model, I want to quickly say that I don't actually believe in this model. There are a bunch of things I probably got wrong, there are a bunch of other, better tools to use, and so on. But the point is not the model: it's that this kind of stuff exists. There exist languages that show that arbitrarily complex, arbitrarily expressive behaviors are provably reducible to local interactions, where behaviors can be understood as clusters of hierarchical interactions that treat all but a few parts of the system at every point as "screened out noise".
I think that if models like this are possible, then a solution to "the interpretability component to safety" is possible in this framework. If you have provably localized behaviors then for example you have a good idea where to look for deception: e.g., deception cannot occur on the level "very low-level" local interactions, as they are too simple to express the necessary reasoning, and perhaps it can be carefully operationalized and tracked in the higher-level interactions.
As you've no doubt noticed, this whole picture is splotchy and vague. It may be completely wrong. But there also may be something in this direction that works. I'm hoping to think more about this, and very interested in hearing people's criticisms and thoughts.
What application do you have in mind? If you're trying to reason about formal models without trying to completely rigorously prove things about them, then I think thinking of neural networks as stochastic systems is the way to go. Namely, you view the weights as a random variable solving a stochastic optimization problem to produce a weight-valued random variable, then conditioning it on whatever knowledge about the weights/activations you assume is available. This can be done both in the Bayesian "thermostatic" sense as a model of idealized networks, and in the sense of modeling the NN as SGD-like systems. Both methods are explored explicitly (and give different results) in suitable high width limits by the PDLT and tensor networks paradigms (the latter also looks at "true SGD" with nonnegligible step size).
Here you should be careful about what you condition on, as conditioning on exact knowledge of too much input-output behavior of course blows stuff up, and you should think of a way of coarse-graining, i.e. "choose a precision scale" :). Here my first goto would be to assume the tempered Boltzmann distribution on the loss at an appropriate choice of temperature for what you're studying.
If you're trying to do experiments, then I would suspect that a lot of the time you can just blindly throw whatever ML-ish tools you'd use in an underdetermined, "true inference" context and they'll just work (with suitable choices of hyperparameters)
This is where this question of "scale" comes in. I want to add that (at least morally/intuitively) we are also thinking about discrete systems like lattices, and then instead of a regulator you have a coarsegraining or a "blocking transformation", which you have a lot of freedom to choose. For example in PDLT, the object that plays the role of coarsegraining is the operation that takes a probability distribution on neurons and applies a single-layer NN to it.
https://www.cond-mat.de/events/correl22/manuscripts/vondelft.pdf
Thanks for the reference -- I'll check out the paper (though there are no pointer variables in this picture inherently).
I think there is a miscommunication in my messaging. Possibly through overcommitting to the "matrix" analogy, I may have given the impression that I'm doing something I'm not. In particular, the view here isn't a controversial one -- it has nothing to do with Everett or einselection or decoherence. Crucially, I am saying nothing at all about quantum branches.
I'm now realizing that when you say map or territory, you're probably talking about a different picture where quantum interpretation (decoherence and branches) is foregrounded. I'm doing nothing of the sort, and as far as I can tell never making any "interpretive" claims.
All the statements in the post are essentially mathematically rigorous claims which say what happens when you
- start with the usual QM picture, and posit that
- your universe divides into at least two subsystems, one of which you're studying
- one of the subsystems your system is coupled to is a minimally informative infinite-dimensional environment (i.e., a bath).
Both of these are mathematically formalizable and aren't saying anything about how to interpret quantum branches etc. And the Lindbladian is simply a useful formalism for tracking the evolution of a system that has these properties (subdivisions and baths). Note that (maybe this is the confusion?) subsystem does not mean quantum branch, or decoherence result. "Subsystem" means that we're looking at these particles over here, but there are also those particles over there (i.e. in terms of math, your Hilbert space is a tensor product
Also, I want to be clear that we can and should run this whole story without ever using the term "probability distribution" in any of the quantum-thermodynamics concepts. The language to describe a quantum system as above (system coupled with a bath) is from the start a language that only involves density matrices, and never uses the term "X is a probability distribution of Y". Instead you can get classical probability distributions to map into this picture as a certain limit of these dynamics.
As to measurement, I think you're once again talking about interpretation. I agree that in general, this may be tricky. But what is once again true mathematically is that if you model your system as coupled to a bath then you can set up behaviors that behave exactly as you would expect from an experiment from the point of view of studying the system (without asking questions about decoherence).
Thanks for the questions!
- Yes, "QFT" stands for "Statistical field theory" :). We thought that this would be more recognizable to people (and also, at least to some extent, statistical is a special case of quantum). We aren't making any quantum proposals.
-
- We're following (part of) this community, and interested in understanding and connecting the different parts better. Most papers in the "reference class" we have looked at come from (a variant of) this approach. (The authors usually don't assume Gaussian inputs or outputs, but just high width compared to depth and number of datapoints -- this does make them "NTK-like", or at least perturbatively Gaussian, in a suitable sense).
- Neither of us thinks that you should think of AI as being in this regime. One of the key issues here is that Gaussian models can not model any regularities of the data beyond correlational ones (and it's a big accident that MNIST is learnable by Gaussian methods). But we hope that what AIs learn can largely be well-described by a hierarchical collection of different regimes where the "difference", suitably operationalized, between the simpler interpretation and the more complicated one is well-modeled by a QFT-like theory (in a reference class that includes perturbatively Gaussian models but is not limited to them). In particular one thing that we'd expect to occur in certain operationalizations of this picture is that once you have some coarse interpretation that correctly captures all generalizing behaviors (but may need to be perturbed/suitably denoised to get good loss), the last and finest emergent layer will be exactly something in the perturbatively Gaussian regime.
- Note that I think I'm more bullish about this picture and Lauren is more nuanced (maybe she'll comment about this). But we both think that it is likely that having good understanding of perturbatively Gaussian renormalization would be useful for "patching in the holes", as it were, of other interpretability schemes. A low-hanging fruit here is that whenever you have a discrete feature-level interpreatation of a model, instead of just directly measuring the reconstruction loss you should at minimum model the difference model-interpretation as a perturbative Gaussian (corresponding to assuming the difference has "no regularity beyond correlation information").
- We don't want to assume homogeneity, and this is mostly covered by 2b-c above. I think the main point we want to get across is that it's important and promising to try to go beyond the "homogeneity" picture -- and to try to test this in some experiments. I think physics has a good track record here. Not on the level of tigers, but for solid-state models like semiconductors. In this case you have:
- The "standard model" only has several-particle interactions (corresponding to the "small-data limit").
- By applying RG techniques to a regular metallic lattice (with initial interactions from the standard model), you end up with a good new universality class of QFT's (this now contains new particles like phonons and excitons which are dictated by the RG analysis at suitable scales). You can be very careful and figure out the renormalization coupling parameters in this class exactly, but much more realistically and easily you just get them from applying a couple of measurements. On an NN level, "many particles arranged into a metallic pattern" corresponds to some highly regular structure in the data (again, we think "particles" here should correspond to datapoints, at least in the current RLTC paradigm).
- The regular metal gives you a "background" theory, and now we view impurities as a discrete random-feature theory on top of this background. Physicists can still run RG on this theory by zooming out and treating the impurities as noise, but in fact you can also understand the theory on a fine-grained level near an impurity by a more careful form of renormalization, where you view the nearest several impurities as discrete sources and only coarsegrain far-away impurities as statistical noise. At least for me, the big hope is that this last move is also possible for ML systems. In other words, when you are interpreting a particular behavior of a neural net, you can model it as a linear combination of a few messy discrete local circuits that apply in this context (like the complicated diagram from Marks et al below) plus a correctly renormalized background theory associated to all other circuits (plus corrections from other layers plus ...)
To add: I think the other use of "pure state" comes from this context. Here if you have a system of commuting operators and take a joint eigenspace, the projector is mixed, but it is pure if the joint eigenvalue uniquely determines a 1D subspace; and then I think this terminology gets used for wave functions as well
One person's "occam's razor" may be description length, another's may be elegance, and a third person's may be "avoiding having too much info inside your system" (as some anti-MW people argue). I think discussions like "what's real" need to be done thoughtfully, otherwise people tend to argue past each other, and come off overconfident/ underinformed.
To be fair, I did use language like this so I shouldn't be talking -- but I used it tongue-in-cheek, and the real motivation given in the above is not "the DM is a more fundamental notion" but "DM lets you make concrete the very suggestive analogy between quantum phase and probability", which you would probably agree with.
For what it's worth, there are "different layers of theory" (often scale-dependent), like classical vs. quantum vs. relativity, etc., where there I think it's silly to talk about "ontological truth". But these theories are local conceptual optima among a graveyard of "outdated" theories, that are strictly conceptually inferior to new ones: examples are heliocentrism (and Ptolemy's epycycles), the ether, etc.
Interestingly, I would agree with you (with somewhat low confidence) that in this question there is a consensus among physicists that one picture is simply "more correct" in the sense of giving theoretically and conceptually more elegant/ precise explanations. Except your sign is wrong: this is the density matrix picture (the wavefunction picture is genuinely understood as "not the right theory", but still taught and still used in many contexts where it doesn't cause issues).
I also think that there are two separate things that you can discuss.
- Should you think of thermodynamics, probability, and things like thermal baths as fundamental to your theory or incidental epistemological crutches to model the world at limited information?
- Assuming you are studying a "non-thermodynamic system with complete information", where all dynamics is invertible over long timescales, should you use wave functions or density matrices?
Note that for #1, you should not think of a density function as a probability distribution on quantum states (see the discussion with Optimization Process in the comments), and this is a bad intuition pump. Instead, the thing that replaces probability distributions in quantum mechanics is a density matrix.
I think a charitable interpertation of your criticism would be a criticism of #1 (putting limited-info dynamics -- i.e., quantum thermodynamics) as primary to "invertible dynamics". Here there is a debate to be had.
I think there is not really a debate in #2: even in invertible QM (no probability), you need to use density matrices if you want to study different subsystems (e.g. when modeling systems existing in an infinite, but not thermodynamic universe you need this language, since restricting a wavefunction to a subsystem makes it mixed). There's also a transposed discussion, that I don't really understand, of all of this in field theory: when do you have fields vs. operators vs. other more complicated stuff, and there is some interesting relationship to how you conceptualize "boundaries" - but this is not what we're discussing. So you really can't get away from using density matrices even in a nice invertible universe, as soon as you want to relate systems to subsystems.
For question #1 is reasonable (though I don't know how productive) to discuss what is "primary". I think (but here I am really out of my depth) that people who study very "fundamental" quantum phenomena increasingly use a picture with a thermal bath (e.g. I vaguely remember this happening in some lectures here). At the same time, it's reasonable to say that "invertible" QM phenomena are primary and statistical phenomena are ontological epiphenomena on top of this. While this may be a philosophical debate, I don't think it's a physical one, since the two pictures are theoretically interchangeable (as I mentioned, there is a canonical way to get thermodynamics from unitary QM as a certain "optimal lower bound on information dynamics", appropriately understood).
Still, as soon as you introduce the notion of measurement, you cannot get away from thermodynamics. Measurement is an inherently information-destroying operation, and iiuc can only be put "into theory" (rather than being an arbitrary add-on that professors tell you about) using the thermodynamic picture with nonunitary operators on density matrices.
Thanks - you're right. I have seen "pure state" referring to a basis vector (e.g. in quantum computation), but in QTD your definition is definitely correct. I don't like the term "pointer variable" -- is there a different notation you like?
Yeah, this also bothered me. The notion of "probability distribution over quantum states" is not a good notion: the matrix I is both (|0\rangle \langle 0|+|1\rangle \langle 1|) and (|a\rangle \langle a|+|b\rangle \langle b|) for any other orthogonal basis. The fact that these should be treated equivalently seems totally arbitrary. The point is that density matrix mechanics is the notion of probability for quantum states, and can be formalized as such (dynamics of informational lower bounds given observations). I was sort of getting at this with the long "explaining probability to an alien" footnote, but I don't think it landed (and I also don't have the right background to make it precise)
I've found our Agent Smith :) If you are serious, I'm not sure what you mean. Like there is no ontology in physics -- every picture you make is just grasping at pieces of whatever theory of everything you eventually develop
I like this! Something I would add at some point before unitarity is that there is another type of universe that we almost inhabit, where your vectors of states have real positive coefficients that sum to 1, and your evolution matrices are Markovian (i.e., have positive coefficients and preserve the sum of coordinates). In a certain sense in such a universe it's weird to say "the universe is .3 of this particle being in state 1 and .7 of it being in state 2", but if we interpret this as a probability, we have lived experience of this.
Something that I like to point out that clicked for me at some point and serves as a good intuition pump, is that for many systems that have a real and quantum analogue, there is actually an interpolated collection of linear dynamics problems like you described that exactly interpolates between quantum and statistical. There's a little bit of weirdness here, BTW, since there's this weird nonlinearity ("squaring the norm") that you need to go from quantum to classical systems. The reason for this actually has to do with density matrices.
There's a whole post to be written on this, but the basic point is that "we've been lied to": when you're introduced to QM and see a wavefunction , this actually doesn't correspond to any linear projection/disentanglement/etc. of the "multiverse state". What instead is being linearly extracted from the "multiverse state" is the external product matrix which is the complex-valued matrix that projects to the 1-dimensional space spanned by the wave function. Now the correction of the "lie" is that the multiverse state itself should be thought of as a matrix. When you do this, the new dynamics now acts on the space of matrices. And you see that the quantum probabilities are now real-valued linear invariants of this state (to see this: the operation of taking the outer product with itself is quadratic, so the "squared norm" operators are now just linear projections that happen to have real values). In this picture, finding the probability of a measurement has exactly the same type signature as measuring the "probability of an event" in the statistical picture: namely, it is a linear function of the "multiverse vector" (just a probability distribution on states in the "statistical universe picture"). Now the evolution of the projection matrix still comes from a linear evolution on your "corrected" vector space of matrix states (in terms of your evolution matrix U, it takes the matrix M to , and of course each coefficient of the new matrix is linear in the old matrix). So this new dynamics is exactly analogous to probability dynamics, with the exception that your matrices are non-Markovian (indeed, on the level of matrices they are also unitary or at least orthogonal) and you make an assumption on your initial "vector" that, when viewed as a matrix, it is rank-1 complex projection matrix, i.e. has the form (In fact if you drop this assumption of being rank-1 and look instead at the linear subspace of matrices these generate -- namely, Hermitian matrices -- then you also get reasonable quantum mechanics, and many problems in QM in fact force you to make this generalization.)
The elves care, Alex. The elves care.
Why I'm in AI sequence: 2020 Journal entry about gpt3
I moved from math academia to full-time AI safety a year ago -- in this I'm in the same boat as Adam Shai, whose reflection post on the topic I recommend you read instead of this.
In making the decision, I went through a lot of thinking and (attempts at) learning about AI before that. A lot of my thinking had been about whether a pure math academic can make a positive difference in AI, and examples that I thought counterindicated this -- I finally decided this might be a good idea after talking to my sister Lizka extensively and doing MATS in Summer of 2023. I'm thinking of doing a more detailed post about my decision and thinking later, in case there are other academics thinking about making this transition (and feel free to reach out in pm's in this case!).
But one thing I have started to forget is how scary and visceral AI risk felt when I was making the decision. I'm both glad and a little sad that the urgency is less visceral and more theoretical now. AI is "a part of the world", not an alien feature: part of the "setting" in the Venkat Rao post that was part of my internal lexicon at the time.
For now, in order to fill a gap in my constantly flagging daily writing schedule, I'll share a meandering entry from 2020 about how I thought about positive AI futures. I don't endorse a lot of it; much is simplistic and low-context, or alternatively commonplace in these circles, though some of it holds up. It's interesting reading back that the thing I thought was most interesting as a first attempt at orienting my thinking was fleshing out "positive futures" and what they might entail. Two big directional updates I've had since are thinking harder about "human alignment" and "human takeover", and trying to temper the predictions that assume singularitarian "first-past-the-post" AGI for a messier "AI-is-kinda-AGI" world that we will likely end up in.
journal entry
7/19/2020 [...] I'm also being paranoid about GPT-3.
Let's think. Will the world end, and if so, when? No one knows, obviously. GPT-3 is a good text generation bot. It can figure out a lot about semantics, mood, style, even a little about humor. It's probably not going to take over the world yet. But how far away are we from AGI?
GPT-3 makes me think, "less than a decade". There's a possibility it will be soon (within the year). I'd assign that probability 10%. It felt like 20% when I first saw its text, but seeing Sam Altman's remark and thinking a little harder, I don't think it's quite realistic for it to go AGI without a significant extra step or two. I think that I'd give it order of 50% within the decade. So it's a little like living with a potentially fatal disease, with a prognosis of 10 years. Now we have no idea what AGI will be like. It will most likely either be very weird and deadly or revolutionary and good, though disappointing in some ways. I think there's not much we can do about the weird and deadly scenarios. Humans have lived in sociopathic times (see Venkat's notes on his 14th Century Europe book). It would probably be shorter and deadlier than the plague; various "human zoo" scenarios may be pleasant to experience (after all zoo animals are happier in general than in the wild, at least from the point of view of basic needs), but harrowing to imagine. In any case, it's not worth speculating on this.
What would a good outcome look like? Obviously, no one knows. It's very hard to predict our interaction with a super-human intelligence. But here are some pretty standard "decent" scenarios: (1) After a brief period of a pro-social AI piloted by a team of decent people, we end up with a world much like ours but with AI capabilities curbed for a long period of time [...]. If it were up to me I would design this world with certain "guard rail"-like changes: to me this would be a "Foundation"-style society somewhere in New Zealand (or on the bottom of the ocean perhaps? the moon?) consisting of people screened for decency, intelligence, etc. (but with serious diversity and variance built in), and with control of the world's nukes, with the responsibility of imposing very basic non-interference and freedom of immigration criteria on the world's societies (i.e., making the "archipelago" dream a reality, basically). So enforcing no torture, disincentivizing violent conflict, imposing various controls to make sure people can move from country to country and are exposed to the basic existence of a variety of experiences in the world, but allowing for culturally alien or disgusting practices in any given country: such as Russian homophobia, strict Islamic law, unpleasant-seeming (for Western Europeans) traditions in certain tribal cultures, etc. This combined with some sort of non-interventionist altruistic push. In this sci-fi scenario the Foundation-like culture would have de facto monopoly of the digital world (but use it sparingly) and also a system of safe nuclear power plants sufficient to provide the world's power (but turned on carefully and slowly, to prevent economic jolts), but to carefully and "incontrovertibly" turn most of the proceeds into a universal basic income for the entire world population. Obviously this would have to be carefully thought out first by a community of intelligent and altruistic people with clear rules of debate/decision. --- The above was written extremely sleepy. [...]
(2) (Unlikely) AI becomes integrated with (at first, decent and intelligent later, all interested) humans via some kind of mind-machine interface, or alternatively a faithful human modeling in silica. Via a very careful and considered transition (in some sense "adiabatic", i.e. designed so as not to lose any of our human ethos and meaning that can possibly be recovered safely) we become machines, with a good and meaningful (not wireheaded, other than by considered choice) world left for the hold-outs who chose to remain purely human.
(3) The "Her" scenario: AI takes off on its own, because of human carelessness or desperation. It develops in a way that cherishes and almost venerates humans, and puts effort into making a good, meaningful existence for humans (meaningful and good in sort of the above adiabatic sense, i.e. meaningful via a set of clearly desirable stages of progress from step to step, without hidden agendas, and carefully and thoughtfully avoiding creating or simulating, in an appropriate sense, anything that would be considered a moral horror by locally reasonable intelligences at any point in the journey). AI continues its own existence, either self-organized to facilitate this meaningful existence of humans or doing its own thing, in a clearly separated and "transcendent" world, genuinely giving humans a meaningful amount of self-determination, while also setting up guardrails to prevent horrors and also perhaps eliminating or mitigating some of the more mundane woes of existence (something like cancer, etc.) without turning us into wireheads.
(4) [A little less than ideal by my book, but probably more likely than the others]: The "garden of plenty" scenario. AI takes care of all human needs and jobs, and leaves all humans free to live a nevertheless potentially fulfilling existence, like aristocrats or Victorians but less classist, socializing learning reading, etc., with the realization that all they are doing is a hobby: perhaps "human-generated knowledge" would be a sort of sport, or analog of organic produce (homeopathically better, but via a game that makes everyone who plays it genuinely better in certain subtle ways). Perhaps AI will make certain "safe" types of art, craft and knowledge (maybe math! Here I'm obviously being very biased about my work's meaning not becoming fully automated) purely the domain of humans, to give us a sense of self-determination. Perhaps humans are guided through a sort of accelerated development over a few generations to get to the no.2 scenario.
(5) There is something between numbers 3 and 4 above, less ideal than all of the above but likely, where AI quickly becomes an equal player to humans in the domain of meaning-generation, and sort of fills up space with itself while leaving a vaguely better (maybe number 4-like) Earth to humans. Perhaps imposes a time limit on humans (enforced via a fertility cap, hopefully with the understanding that humans can raise AI babies with genuine sense of filial consciousness and complete with bizzarre scences of trying to explain the crazy world of AI to their parents), after which the human project becomes the AI project, probably essentially incomprehensible to us.
There's a sense that I have that while I'm partial to scenarios 1 and 2: I want humans to retain the monopoly on meaning-generation and to be able to feel empowered and important, it will be seen to be old-fashioned and almost dangerous by certain of my peers because of the lack of emphasis on harm-prevention, stable future, etc. I think this is part of the very serious debate, so far abstract and fun, but, as AI gets better, perhaps turning heated and loud, between whether comfort or meaning are more important goals of the human project (and both sides will get weird). I am firmly on the side of meaning, with a strict underpinning of retaining bodily and psychological integrity in all the object-level and meta-level senses (except I guess I'm ok with moving to the cloud eventually? Adiabatic is the word for me). Perhaps my point of view is on the side I think it is just in the weird group of futurists and rationalists that I mostly read when reading about AI: probably the generic human who thinks about AI is horrified by all of the above scenarios and just desperately hoping it will go away on its own, or has some really idiosyncratic mix of the above or other ideas which seem obviously preferable to them.
Yeah I agree that it would be even more interesting to look at various complexity parameters. The inspiration here of course is physics: isolating a particle/effective particle (like a neutron in a nucleus) or an interaction between a fixed set of particles, by putting it in a regime where other interactions and groupings drop out. The goto for a physicist is temperature: you can isolate a neutron by putting the nucleus in a very high-temperature environment like a collider where the constituent baryons separate. This (as well as the behavior wrt generality) is the main reason I suggested for "natural degradation" from SLT, as this samples from the tempered distribution and is the most direct analog of varying temperature (putting stuff in a collider). But you can vary other hyperparameters as well. Probably an even more interesting thing to do is to simultaneously do two things with "opposite" behaviors, which I think is what you're suggesting above. For a cartoon notions of the memorization-generalization "scale" is that if you have low complexity coming from low parameter count/depth or low training time (the latter often behaves similarly to low data diversity), you get simpler "more memorization-y" circuits (I'm planning to talk more about this later in a "learning stories" series -- but from work on grokking, leap complexity, etc. people expect later solutions to generalize better. So if you combine this with the tempering "natural degradation" above, you might be able to get rid of behaviors both above and below a range of interest.
You're right that tempering is not a binary on/off switch. Because of the nature of tempering, you do expect exponential decay of "inefficient" circuits as your temperature gets higher than the "characteristic temp." of the circuit (this is analogous to how localized particles tend to have exponentially less coupling as they get separated), so it's not completely unreasonable to "fully turn off" a class of behaviors. But something special in physics that probably doesn't happen in AI is that the temperature scales relevant for different forces have very high separation (many orders of magnitude), so scales separate very clearly. In AI, I agree that as you described, tempering will only "partially" turn off many of the behaviors you want to clean up. It's plausible that for simple circuits there is enough of a separation of characteristic temperature between the circuit and its interactions with other circuits that something approaching the behavior in physics is possible, but for most phenomena I'd guess that your "things decay more messily" picture is more likely.
Thanks! Are you saying there is a better way to find citations than a random walk through the literature? :)
I didn't realize that the pictures above limit to literal pieces of sin and cos curves (and Lissajous curves more generally). I suspect this is a statement about the singular values of the "sum" matrix S of upper-triangular 1's?
The "developmental clock" observation is neat! Never heard of it before. Is it a qualitative "parametrization of progress" thing or are there phase transition phenomena that happen specifically around the midpoint?
Do the images load now?
Hmm, I'm not sure how what you're describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.
Based on the paper Adam shared, it seems that part of the "amortizing" picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.
It sounds like in the Paul Christiano example, you're assuming access to some internal reasoning components (like activations or chain-of-thought) to set up a student-teacher context. Is this equivalent to the other picture I mentioned?
I'm also curious about what you said about o3 (and maybe have a related confusion about this). I certainly believe that NN's, including RL models, learn by parallel heuristics (there's a lot of interp and theory work that suggests this), but I don't know any special properties of o3 that make it particularly supportive of this point of view
Thanks! I spent a bit of time understanding the stochastic inverse paper, though haven't yet fully grokked it. My understanding here is that you're trying to learn the conditional probabilities in a Bayes net from samples. The "non-amortized" way to do this for them is to choose a (non-unique) maximal inverse factorization that satisfies some d-separation condition, then guess the conditional probabilities on the latent-generating process by just observing frequencies of conditional events -- but of course this is very inefficient, in particular because the inverse factorization isn't a general Bayes net, but must satisfy a bunch of consistency conditions; and then you can learn a generative model for these consistency conditions by a NN and then perform some MCMC sampling on this learned prior.
So is the "moral" you want to take away here then that by exploring a diversity of tasks (corresponding to learning this generative prior on inverse Bayes nets) a NN can significantly improve its performance on single-shot prediction tasks?
FWIW, I like John's description above (and probably object much less than baseline to humorously confrontational language in research contexts :). I agree that for most math contexts, using the standard definitions with morphism sets and composition mappings is easier to prove things with, but I think the intuition described here is great and often in better agreement with how mathematicians intuit about category-theoretic constructions than the explicit formalism.
This phenomenon exists, but is strongly context-dependent. Areas of math adjacent to abstract algebra are actually extremely good at updating conceptualizations when new and better ones arrive. This is for a combination of two related reasons: first, abstract algebra is significantly concerned about finding "conceptual local optima" of ways of presenting standard formal constructions, and these are inherently stable and require changing infrequently; second, when a new and better formalism is found, it tends to be so powerfully useful that papers that use the old formalism (in concepts where the new formalism is more natural) quickly become outdated -- this happened twice in living memory, once with the formalism of schemes replacing other points of view in algebraic geometry and once with higher category theory replacing clunkier conceptualizations of homological algebra and other homotopical methods in algebra. This is different from fields like AI or neuroscience, where oftentimes using more compute, or finding a more carefully taylored subproblem is competitive or better than "using optimal formalism". That said, niceness of conceptualizations depends on context and taste, and there do exist contexts where "more classical" or "less universal" characterizations are preferable to the "consensus conceptual optimum".
This is very nice! So the way I understand what you linked is this: the class of perturbative expansions in the "Edgeworth expansion" picture I was distilling is that the order-d approximation for the probability distribution associated to the sum variable S_n above is where is the probability distribution associated with a Gaussian and is a polynomial in t and the perturbative parameter . The paper you linked says that a related natural thing to do is to take the Fourier transform, which will be the product of the Gaussian pdf and a different polynomial in the fourier parameter t and the inverse perturbation parameter . You can then look at the leading terms, which will be (maybe up to some fixed scaling) a polynomial in and this gives some kind of "leading" Edgeworth contribution.
Here this can be interpreted as a stationary phase formula, but you can only get "perturbative" theories, i.e. the relevant critical set will be nonsingular (and everything is expressed as a Feynman diagram with edges decorated by the inverse Hessian). But you're saying that if you take this idea and apply it to different interesting sequences of random variables (not sum variables, but other natural asymptotic limits of other random processes), you can get singular stationary phase (i.e. the Watanabe expansion). Is there an easy way to describe the simplest case that gives an interesting Watanabe expansion?
Thanks for asking! I said in a later shortform that I was trying to do too many things in this post, with only vague relationships between them, and I'm planning to split it into pieces in the future.
Your 1-3 are mostly correct. I'd comment as follows:
(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term -- let's shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is simpler than to compute expectations. Rather, it's just to run a single experiment at a weight sampled from the TLBP. This is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you're adding no noise and at you're fully noising it.
This is interesting to do in interp experiments for two general reasons:
- You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
- If you are hoping to say that a behavior you found, e.g. a circuit, is "natural from the circuit's point of view" (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn't just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn't at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such "clean" effects that only do what you want and don't affect loss otherwise. In general, in an imprecise sense, you expect each "true" circuit to have some "temperature of entanglement" with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
- In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it's much too low in most cases. Instead, you're either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are "significantly general". Here the characteristic temperature associated with the level of generality that "is not literally memorizing" is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
- (maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for "general interp". I also explain that you sometimes have a natural "characteristic temperature" for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you're studying and a SOTA NN, which you think of as that "true optimal loss". In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is "an imperfect approximation of an optimal NN", the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN's are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a "perfect NN's circuits", they find the easily learnable ones rather than the maximally general ones). However it's still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some "inherent" parameter count.
Why you should try degrading NN behavior in experiments.
I got some feedback on the post I wrote yesterday that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.
I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.
This main point is:
experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some "natural" degradations of the NN's performance, and certain dials are better than others depending on context.
I am eventually planning splitting out the post into a few parts, one of which explains this more carefully. When I do this I will replace the current version of the post with just a discussion of the "koan" itself: i.e., nitpicks about work that isn't careful about thinking about the scale at which it is performing interpretability.
For now I want to give a quick reductive take on what I hope to be the main takeaway of this discussion. Namely, why I think "interpretability on degraded networks" is important for better interpretability.
Basically: when ML experiments modify a neural net to identify or induce a particular behavior, this always degrades performance. Now there are two hypotheses for what is going on:
-
You are messily pulling your NN in the direction of a particular behavior, and confusing this spurious messy phenomenon with finding a "genuine" phenomenon from the program's point of view.
-
You are messily pulling your NN in the direction of a particular behavior, but also singling out a few "real" internal circuits of the NN that are carrying out this behavior.
Because of how many parameters you have to play with and the polysemanticity of everything in a NN, it's genuinely hard to tell these two behaviors apart. You might find stuff that "looks" like a core circuit, but actually is just bits of other circuits combined together, and your circuit-fitting experiment makes look like a coherent behavior, and any nice properties of the resulting behavior that make it seem like an "authentic" circuit are just artefacts of the way you set up the experiment.
Now the idea behind running this experiment at "natural" degradations of network performance is to try to separate out these two possibilities more cleanly. Namely, an ideal outcome is that in running your experiment on some class of natural degradation of your neural net, you find a regime such that
- the intervention you are running no longer significantly affects the (naturally degraded) performance
- the observed effect still takes place.
Then what you've done is effectively "cleaned up" your experiment such that you are still probably finding interpretable behaviors in the original neural net (since a good degradation is likely to contain a subset of circuits/behaviors of your original net and not many "new behaviors), in a way that sufficiently reduced the complexity that the behavior you're seeking is no longer "entangled" with a bunch of other behaviors; this should significantly update you that the behavior is indeed "natural" and not spurious.
This is of course a very small, idealized sketch. But the basic idea behind looking at neural nets with degraded performance is to "squeeze" the complexity in a controlled way to suitably match the complexity of the circuit (and how it's embedded in the rest of the network/how it interacts with other circuits). If you then have a circuit of "the correct complexity" that explains a behavior, there is in some sense no "complexity room" for other sneaky phenomena to confound it.
In the post, the natural degradation I suggested is the physics-inspired "SLGD sampling" process which in some sense tries to add a maximal amount of noise to your NN while only having a limited impact on performance (measured by loss); this has a bias of keeping "generally useful" circuits and interactions and noising more inessential/ memorize-y circuits. Other interventions that have different properties are "just adding random noise" (either to weights or activations) to suitable reduce performance, or looking at earlier training checkpoints. I suspect that different degradations (or combinations thereof) are appropriate to isolate the relevant complexity of different experiments.
Thanks so much for this! Will edit
Cool! I haven't seen these, good to have these to point to (and I'm glad that Richard Ngo has thought about this)
Thanks for the context! I didn't follow this discourse very closely, but I think your "optimistic assumptions" post wasn't the main offender -- it's reasonable to say that "it's suspicious when people are bad at backchaining but think they're good at backchaining or their job depends on backchaining more than they are able to". I seem to remember reading some responses/ related posts that I had more issues with, where the takeaway was explicitly that "alignment researchers should try harder at backchaining and one-shotting baba-is-you-like problems because that's the most important thing", instead of the more obvious but less rationalism-vibed takeaway of "you must (if at all possible) avoid situations where you have to one-shot complicated games".
I think if I'm reading you correctly, we're largely in agreement. All plan-making and game-playing depends on some amount of backchaining/ one-shot prediction. And there is a part of doing science that looks a bit like this. But there are ways of getting around having to brute-force this by noticing regularities and developing intuitions, taking "explore" directions in explore-exploit tradeoffs, etc. -- this is sort of the whole point of RL, for example.
I also very much like the points you made about plans. I'd love to understand more about your OODA loop points, but I haven't yet been able to find a good "layperson" operationalization of OODA that's not competence porn (in general, I find "sequential problem-solving" stuff coming from pilot training useful as inspiration, but not directly applicable because the context is so different -- and I'd love a good reference here that addresses this carefully).
A vaguely related picture I had in my mind when thinking about the Baba is you discourse (and writing this shortform) comes from being a competitive chess player in middle school. Namely, in middle school competitions and in friendly training games in chess club, people make a big deal out of the "touch move" rule: that you're not allowed to play around with pieces when planning and you need to form a plan entirely in your head. But then when you see a friendly game between two high-level chess players, they will constantly play around with each other's pieces to show each other positions several moves into the game that would result from various choices. To someone on a high level (higher than I ever got to), there is very little difference between playing out a game on the board and playing it out in your head, but it's helpful to move pieces around to communicate your ideas to your partner. I think that (even with a scratchpad), there's a component of this here: there is a kind of qualitative difference between "learning to track hypothetical positions well" / "learning results" / "being good at memorization and flashcards" vs. having better intuitions and ideas. A lot of learning a field / being a novice in anything consists of being good at the former. But I think "science" as it were progresses by people getting good at the latter. Here I actually don't think that the "do better vibe" corresponds to not being good at generating new ideas: rather, I think that rationalists (correctly) cultivate a novice mentality, where they constantly learn new skills and approach new areas, where the "train an area of your brain to track sequential behaviors well" (analogous to "mentally chain several moves forward in Baba is you") is the core skill. And then when rationalists do develop this area and start running "have and test your ideas and intuitions in this environment" loops, these are harder to communicate/ analyze, and so their importance sort of falls off in the discourse (while on an individual level people are often quite good at these -- in fact, the very skill of "communicating well about sequential thinking" is something that many rationalists have developed deep competence in I think).
So the oscillating phase formula is about approximately integrating the function against various "priors" p(x) (or more generally any fixed function g), where f is a Lagrangian (think energy) and (\hbar) is a small parameter. It gives an asymptotic series in powers of . The key point is that (more or less) the kth perturbative term only depends on the kth-order power series expansion of f around the "stationary points" (i.e., saddlepoints, Jac(f) = 0) when f is imaginary, on the maxima of f when f is real, and there is a mixed form that depends on stationary points of the imaginary part which are also maxima of the real part (if these exist); the formulae are all exactly the same, with the only difference between real and imaginary f (i.e. statistical vs. quantum mechanics) being whether you only keep maxima or all saddle points.
Now in SLT, you're exactly applying the "real" stationary phase formula, i.e., looking at maxima of the (negative) loss function -L(w). The key thing that can happen is that there can be infinitely many maxima, and these might be singular (both in the sense of having higher degree of stationarity, and in the sense of forming a singular manifold). In this case the stationary phase formula is more complicated and AFAIK isn't completely worked out; Watanabe was the first person who contributed to finding expressions for the general case here beyond the leading correction.
In the case of maxima which are nondegenerate, i.e., have positive-definite Hessian, the full perturbative expansion is known; in fact, at least in one very useful frame on it, terms are indexed by Feynman diagrams.
Now the energy function f that appears in this context is the log of the Fourier transform of a probability distribution p(x). Notice that p(x) satisfies and . This means that (\hat{p}(0) = \int f(x) dx) is 1 and its log is 0. You can check that all other values of the Fourier transform are in absolute value (this follows from the fact that . In fact, the Hessian is equal (up to scale) to the variance of p. Now the point is that the only way that variance can be zero is if your pd is concentrated on a lower-dimensional affine subspace, in which case you can simply reduce your problem to a lower-dimensional one with nonsingular Hessian. When this doesn't happen, the function you're applying stationary phase to has only nondegenerate maxima, and so the "standard" Feynman-diagram formula applies instead of the more sophisticated Watanabe one that's used in SLT.
Alignment is not all you need. But that doesn't mean you don't need alignment.
One of the fairytales I remember reading from my childhood is the "Three sillies". The story is about a farmer encountering three episodes of human silliness, but it's set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.
The frame story was much more memorable to me than any of the "body" stories, and I randomly remember this story much more often than any other fairytale I read at the age I read fairytales. I think the reason for this is that the "hanging axe" worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
I really like the rationalist/EA ecosphere because of its emphasis on the solvability of problems like this: noticing situations where you can just approach the problem, taking down the axe. However, a baseline of intellectual neuroticism persists (after all you wouldn't expect otherwise from a group of people who pull smoke alarms on pandemics and existential threats that others don't notice). Sometimes it's harmless or even beneficial. But a kind of neuroticism in the community that bothers me, and seems counterproductive, is a certain "do it perfectly or you're screwed" perfectionism that pervades a lot of discussions. (This is also familiar to me from my time as a mathematician: I've had discussions with very intelligent and pragmatic friends who rejected even the most basic experimentally confirmed facts of physics because "they aren't rigorously proven".)
A particular train of discussion that annoyed me in this vein was the series of responses to Raemon's "preplanning and baba is you" post. The initial post I think makes a nice point -- it suggests as an experiment trying to solve levels of a move-based logic game by pre-planning every step in advance, and points out that this is hard. Various people tried this experiment and found that it's hard. This was presented as an issue in solving alignment, in worlds where "we get one shot". But what annoyed me was the takeaway.
I think a lot of the great things about the intellectual vibe in the (extended) LW and EA communities is that "you have more ways to solve problems than you think". However, there is a particular kind of virtue-signally class of problems where trying to find shortcuts or alternatives is frowned upon and the only accepted form of approach is "trying harder" (another generalized intellectual current in the LW-osphere that I strongly dislike).
Back to the "Baba is you" experiment. The best takeaway, I think, is that we should avoid situations where we need to solve complex problems in one shot, and we should work towards making sure this situation doesn't exist (and we should just give up on trying to make progress in worlds where we get absolutely no new insights before the do-or-die step of making AGI). Doing so, at least without superhuman assistance, is basically impossible. Attempts at this tend to be not only silly but counterproductive: the "graveyard" of failed idealistic movements are chock-full of wannabe Hari Seldons who believe that they have found the "perfect solution", and are willing to sacrifice everything to realize their grand vision.
This doesn't mean we need to give up, or only work on unambitious, practical applications. But it does mean that we have to admit that things can be useful to work on in expectation before we have a "complete story for how they save the world".
Note that what is being advocated here is not an "anything goes" mentality. I certainly think that AI safety research can be too abstract, too removed from any realistic application in any world. But there is a large spectrum of possibilities between "fully plan how you will solve a complex logic game before trying anything" and "make random jerky moves because they 'feel right'".
I'm writing this in response to Adam Jones' article on AI safety content.. I like a lot of the suggestions. But I think the section on alignment plans suffers from the "axe" fallacy that I claim is somewhat endemic here. Here's the relevant quote:
For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment. Unfortunately, this has not gone well - my rough conclusion is that there aren’t any very clear and well publicised plans (or even very plausible stories) for making this go well. (More context on some of this work can be found in BlueDot Impact’s AI safety strategist job posting). (emphasis mine).
I strongly disagree with this being a good thing to do!
We're not going to have a good, end-to-end plan about how to save the world from AGI. Even now, with ever more impressive and scary AIs becoming a comonplace, we have very little idea about what AGI will look like, what kinds of misalignment it will have, where the hard bits of checking it for intent and value alignment will be. Trying to make extensive end-to-end plans can be useful, but can also lead to a strong streetlight effect: we'll be overcommitting to current understanding, current frames of thought (in an alignment community that is growing and integrating new ideas with an exponential rate that can be factored in months, not years).
Don't get me wrong. I think it's valuable to try to plan things where our current understanding is likely to at least partially persist: how AI will interface with government, general questions of scaling and rough models of future development. But we should also understand that our map has lots of blanks, especially when we get down to thinking about what we will understand in the future. What kinds of worrying behaviors will turn out to be relevant and which ones will be silly in retrospect? What kinds of guarantees and theoretical foundations will our understanding of AI encompass? We really don't know, and trying to chart a course through only the parts of the map that are currently filled out is an extremely limited way of looking at things.
So instead of trying to solve the alignment problem end to end what I think we should be doing is:
- getting a variety of good, rough frames on how the future of AI might go
- thinking about how these will integrate with human systems like government, industry, etc.
- understanding more things, to build better models in the future.
I think the last point is crucial, and should be what modern alignment and interpretability is focused on. We really do understand a lot more about AI than we did a few years ago (I'm planning a post on this). And we'll understand more still. But we don't know what this understanding will be. We don't know how it will integrate with existing and emergent actors and incentives. So instead of trying to one-shot the game and write an ab initio plan for how work on quantifying creativity in generative vision models will lead to the world being saved, I think there is a lot of room to just do good research. Fill in the blank patches on that map before routing a definitive course on it. Sure, maybe don't waste time on the patches in the far corners which are too abstract or speculative or involve too much backchaining. But also don't try to predict all the axes that will be on the wall in the future before looking more carefully at a specific, potentially interesting, axe.
I'm not exactly sure about what you mean wrt "what you want" here. It is not the case that you can exactly reconstruct most probability distributions you'll encounter in real life from their moments/ cumulants (hence the expansion is perturbative, not exact).
But in the interpretability/ field-theoretic model of wide NN's point of view, this is what you want (specifically, the fourth-order correction)
Yes, I actually thought about this a bit. It is definitely the case that the LC (or RLCT) in the SLT context is also exactly a (singular) stationary phase expansion. Unfortunately, the Fourier transform of a random variable, including a higher-dimensional one, really does have an isolated nondegenerate maximum at 0 (unless the support of your random variable is contained in a union of linear subspaces, which is kinda boring/ reducible to simpler contexts). Maybe if you think about some kind of small perturbation of a lower-dimensional system, you can get some components of the singular free energy expansion, but the expansion relevant here is really nonsingular. This is also the type signature of the expansion you see in most physical QFT systems, at least if they have a perturbative form (in which case, the free theory will in general be nondegenerate).
Thanks for writing this! I've participated in some similar conversations and on balance, think that working in a lab is probably net good for most people assuming you have a reasonable amount of intellectual freedom (I've been consistently impressed by some papers coming out of Anthropic).
Still, one point made by Kaarel in a recent conversation seemed like an important update against working in a lab (and working on "close-to-the-metal" interpretability in general). Namely, I tend to not buy arguments by MIRI-adjacent people that "if we share our AI insights with the world then AGI will be developed significantly sooner". These were more reasonable when they were the only ones thinking seriously about AGI, but now it mostly seems that a capabilities researcher will (on the margin, and at the same skill level) contribute more to making AGI come soon than a safety researcher. But a counterpoint is that serious safety researchers "are trying to actually understand AI", which has a global orientation towards producing valuable new research results (something like people at the Manhattan project or Apollo program at the hight of these programs' quality), whereas a capabilities researcher is more driven by local market incentives. So there may be a real sense in which interpretability research, particularly of more practical types, is more dangerous, conditional on "globally new ideas" (like deep learning, transformers etc.) being needed for AGI. This was so far the most convincing argument for me against working on technical interpretability in general, and it might be complicated further by working in a big lab (as I said, it hasn't been enough to flip my opinion, but seems worth sharing)
Maybe a reductive summary is "general is good if outer alignment is easy but inner alignment is hard, but bad in the opposite case"
I haven't thought about this enough to have a very mature opinion. On one hand being more general means you're liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI's are more liable to "partially memorize" how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of "how general are humans", and how stable are human values to being much more or much less general
You mean on more general algorithms being good vs. bad?
Yep, have been recently posting shortforms (as per your recommendation), and totally with you on the "halfbaked-by-design" concept (if Cheeseboard can do it, it must be a good idea right? :)
I still don't agree that free energy is core here. I think that the relevant question, which can be formulated without free energy, is whether various "simplicity/generality" priors push towards or away from human values (and you can then specialize to questions of effective dimension/llc, deep vs. shallow networks, ICL vs. weight learning, generalized ood generalization measurements, and so on to operationalize the inductive prior better). I don't think there's a consensus on whether generality is "good" or "bad" -- I know Paul Christiano and ARC has gone both ways on this at various points.
On the surprising effectiveness of linear regression as a toy model of generalization.
Another shortform today (since Sunday is the day of rest). This time it's really a hot take: I'm not confident about the model described here being correct.
Neural networks aren't linear -- that's the whole point. They notice interesting, compositional, deep information about reality. So when people use linear regression as a qualitative comparison point for behaviors like generalization and learning, I tend to get suspicious. Nevertheless, the track record of linear regression as a model for "qualitative" asymptotic behaviors is hard to deny. Linear regression models (neatly analyzable using random matrix theory) give surprisingly accurate models of double descent, scaling phenomena, etc. (at least when comparing to relatively shallow networks like mnist or modular addition).
I recently developed a cartoon internal model for why this may be the case. I'm not sure if it's correct, but I'll share it here.
The model assumes a few facts about algorithms implemented by NN's (all of which I believe in much more strongly than the model of comparing them to linear regression):
- The generalized Linear Representation Hypothesis. An NN's internal working can be locally factorized into a large collection of low-level features in distinct low-dimensional linear subspaces, and then applying (generally nonlinear) postprocessing to these features independently or in small batches. Note that this is much weaker than a stronger version (such as the one inherent in SAE) that posits 1-dimensional features. In my experience a version of this hypothesis is almost universally believed by engineers, and also agrees with all known toy algorithms discovered so far.
- Smoothness-ish of the data manifold. Inside the low-dimensional "feature subspace", the data is kinda smooth -- i.e., it's smooth (i.e., locally approximately linear) in most directions, and in directions where it's not smooth, it still might behave sorta smoothly in aggregate.
- Linearity-ish of the classification signal. Even in cases like mnist or transformer learning where the training data is discrete (and the algorithm is mean to approximate it by a continuous function), there is a sense in which it's locally well-approximated by a linear function. E.g. perhaps some coarse-graining of the discrete data is continuously linear, or at least the data boundary can be locally well approximated by a linear hyperplane (so that a local linear function can attain 100% accuracy). More generally, we can assume a similar local linearity property on the layer-to-layer forward functions, when restricted to either a single feature space or a small group of interacting feature spaces.
- (At least partial) locality of the effect of weight modification. When I read it this paper left a lasting impression on me. I'm actually not super excited about its main claims (I'll discuss polytopes later), but a very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat "shallow" point of view on learning, but probably captures a nontrivial part of what's going on, and this predicts that every new weight update only has local effect -- i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you're defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it's a good model for "local learning", i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes "Barack" with "Obama"). There are also possibly also more diffuse phenomena (like "understanding logic", or other forms of grokking "overarching structure"), but most likely both forms of learning occur (and it's more like a spectrum than a dichotomy).
If we buy that these four phenomena occur, or even "occur sometimes" in a way relevant for learning, then it naturally follows that a part of the "shape" of learning and generalization is well described qualitatively by linear regression. Indeed, the model then becomes that (by point 4 above), many weight updates exclusively "focus on a single local batch of input points" in some low-dimensional feature manifold. For this particular weight update, locality of the update and smoothness of the data manifold (#2) together imply that we can model it as learning a function on a linear low-dimensional space (since smooth manifolds are locally well-approximated by a linear space). Finally, local linearity of the classification function (#3) implies that we're learning a locally linear function on this local batch of datapoints. Thus we see that, under this collection of assumptions, the local learning subproblems essentially boil down to linear regression.
Note that the "low-dimensional feature space" assumption, #1, is necessary for any of this to even make sense. Without making this assumption, the whole picture is a non-starter and the other assumptions, #2-#4 don't make sense, since a sub-exponentially large collection of points on a high-dimensional data manifold with any degree of randomness (something that is true about the data samples in any nontrivial learning problem) will be very far away from each other and the notion of "locality" becomes meaningless. (Note also that a weaker hypothesis than #1 would suffice -- in particular, it's enough that there are low-dimensional "feature mappings" where some clustering occurs at some layer, and these don't a priori have to be linear.)
What is this model predicting? Generally I think abstract models like this aren't very interesting until they make a falsifiable prediction or at least lead to some qualitative update on the behavior of NN's. I haven't thought about this very much, and would be excited if others have better ideas or can think of reasons why this model is incorrect. But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the "locality" magnifying glass on each subspace, you'll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.
This point of view makes me excited about work by Ari Brill on fractal data manifolds and scaling laws. If I the paper correctly, he models a data manifold as a certain stochastic fractal in a low-dimensional space and makes scaling predictions about generalization behavior depending on properties of the fractal, by thinking of the fractal as a hierarchy of smooth but noisy features. Finding similarly-flavored behavior scaling behavior on "linear regression subphenomena" in a real-life machine learning problem would positively update me on my model above being correct.
Thanks! I definitely believe this, and I think we have a lot of evidence for this in both toy models and LLMs (I'm planning a couple of posts on this idea of "training stories"), and also theoretical reasons in some contexts. I'm not sure how easy it is to extend the specific approach used in the proof for parity to a general context. I think it inherently uses the fact of orthogonality of Fourier functions on boolean inputs, and understanding other ML algorithms in terms of nice orthogonal functions seems hard to do rigorously, unless you either make some kind of simplifying "presumption of independence" model on learnable algorithms or work in a toy context. In the toy case, there is a nice paper that does exactly this (explains how NN's will tend to find "incrementally learnable" algorithms), by using a similar idea to the parity proof I outlined. This is the leap complexity paper (that Kaarel and I have looked into; I think you've also looked into related things)
I'm not sure I agree with this -- this seems like you're claiming that misalignment is likely to happen through random diffusion. But I think most worries about misalignment are more about correlated issues, where the training signal consistently disincentivizes being aligned in a subtle way (e.g. a stock trading algorithm manipulating the market unethically because the pressure of optimizing income at any cost diverges from the pressure of doing what its creators would want it to do). If diffusion were the issue, it would also affect humans and not be special to AIs. And while humans do experience value drift, cultural differences, etc., I think we generally abstract these issues as "easier" than the "objective-driven" forms of misalignment
Ah I think that the notion of amortized inference that you're using encapsulates what I'm saying about chess. I'm still a little confused about the scope of the concept though -- do you have a good cached explanation?
I feel like the term "amortization" in ML/CS has a couple of meanings. Do you just mean redistributing compute from training to inference?
I think this is an interesting model, but I also think that part of the use of CoT is more specific to the language/logic context, to literally think step by step (which sometimes lets you split problems into subproblems). In some limit, there would be exponentially few examples in the training data of directly "thinking n steps ahead", so a transformer wouldn't be able to learn to do this at all (at least without some impressive RL). Like imagine training a chess playing computer to play chess, by only looking at every 10th move of a chess game: probably with enough inference power, a very powerful system wold be able to reconstruct the rules of chess as the best way of making sense of the regularities in the information, but this is in some sense exponentially harder than learning from looking at every move
Cute :). Do you mean that we've only engineered the alien computer running a single program (the standard model with our universe's particular coupling constants), or something else?