Posts
Comments
The "History of Philosophy Without any Gaps" podcast (https://historyofphilosophy.net/) has for a while been alternating between weeks of Western and non-Western philosophy (which it does in a bit less detail, but still pretty in-depth). It's so far finished a series each on Indian and Africana philosophy and is currently starting on Ancient Chinese philosophy.
Insofar as you're thinking of evolution as analogous to gradient flow, it only makes sense if it's local and individual-level I think -- it is a category error to say that a species that has more members is a winner. The first shark that started eating its siblings in utero improved its genetic fitness (defined as the expected number of offspring in the specific environment it existed in) but might have harmed the survivability of the species as a whole.
It's neat to remember stories like this, but I want to note that this shouldn't necessarily update scientists to criticize novel work less. If an immune system doesn't sometimes overreact, it's not doing its job right, and for every story like this there are multiple other stories of genuinely false exciting-sounding ideas that got shut down by experts (for instance I learned about Schekhtman from the Constant podcast, where his story was juxtaposed with that of genuine quacks). Looking back at my experience of excited claims that were generally dismissed by more skeptical experts in fields I was following, the majority of them (for instance the superluminal neutrino, the room-temperature superconductor, various hype about potentially proving the Riemann hypothesis by well-established mathematicians) have been false.
I think there is a separate phenomenon (which was the explanation for the study about funerals), that older high-status scientists in funding-hungry fields will often continue to get funding and set priorities after they have stopped working on genuinely exciting stuff -- whether because of age, because of age-related conservatism bias, or simply because their area of expertise has become too well-developed to generate new ideas. In my experience in math and physics, from inside the field, this phenomenon generally does not look like a consensus that only the established people know what's going on (as in most of the stories here), but either conversely a quiet consensus that so-and-so famous person is starting to go crazy, or alternatively the normal disagreement between more conservative and more innovation-minded people about the value of a new idea. For example the most exciting development in my professional life as a mathematician was Jacob Lurie's development of "higher category theory", a revolution that allowed algebraists to seamlessly use tools from topology. There were many haters of this theory (many very young), but there was enough of a diffuse understanding that this is exciting and potentially revolutionary that his ideas did percolate and end up converting many of the haters (similarly with Grothendieck and schemes). Note that here I think math avoids the worst aspects of these dynamics because it doesn't require funding and is less competitive.
The upshot here is that I think it's valuable to try to resolve the issue of good ideas being shot down by traditionalists, but the solution might not be to "adopt lower standards for criticizing new / surprising ideas" but rather something more like pulling the rope sideways and looking for better standards that do better at separating promising innovation from hype.
Yes - I generally agree with this. I also realized that "interp score" is ambiguous (and the true end-to-end interp score is negligible, I agree), but what's more clearly true is that SAE features tend to be more interpretable. This might be largely explained by "people tend to think of interpretable features as branches of a decision tree, which are sparsely activating". But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features
So to elaborate: we get significantly more interpretable features if we enforce sparsity than if we just do more standard clustering procedures. This is nontrivial! Of course this might be saying more about our notions of "interpretable feature" and how we parse semantics; but I can certainly imagine a world where PCA gives much better results, and would have in fact by default expected this to be true for the "most important" features even if I believed in superposition.
So I'm somewhat comfortable saying that the fact that imposing sparsity works so well is telling us something. I don't expect this to give "truly atomic" features from the network's PoV (any more than understanding Newtonian physics tells us about the standard model), but this seems like nontrivial progress to me.
I basically agree with you. But I think we have some nontrivial information, given enough caveats.
I think there are four hypotheses:
- 1a. Neurons do >1 thing (neuron polysemanticity)
- 1b. Insofar as we can find interesting atomic semantic features, they have >1 neuron (feature polysemanticity)
- 2a. Features are sparse, i.e., (insofar as there exist interesting atomic semantic features), most have significantly <1/2 probability of being "on" for any input
- 2b. Features are superpositional, i.e., (insofar as there exist interesting atomic semantic features), there are significantly more than dimension-many of them at any layer.
I think 1a/b and 2a/b are different in subtle ways, but most people would agree that in a rough directional yes/no sense, 1a<=>1b and 2a<=>2b (note that 2a=>2b requires some caveats -- but at the limit of lots of training in a complex problem with #training examples >= #parameters, if you have sparsity of features, it simply is inefficient to not have some form of superposition). I also agree that 1a/1b are a natural thing to posit a priori, and in fact it's much more surprising to me as someone coming from math that the "1 neuron:1 feature" hypothesis has any directional validity at all (i.e., sometimes interesting features are quite sparse in the neuron basis), rather than that anything that looks like a linear feature is polysemantic.
Now to caveat my statement: I don't think that neural nets are fully explained by a bunch of linear features, much less by a bunch of linear features in superposition. In fact, I'm not even close to100% on superposition existing at all in any truly "atomic" decomposition of computation. But at the same time we can clearly find semantic features which have explanatory power (in the same way that we can find pathways in biology, even if they don't correspond to any structures on the fundamental, in this case cellular, level).
And when I say that "interesting semantic features exist in superposition", what I really mean is that we have evidence for hypothesis 2a [edited, originally said 2b, which is a typo]. Namely, when we're looking for unsupervised ways to get such features, it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA. I think this is pretty strong evidence!
Thank you for the great response, and the (undeserved) praise of my criticism. I think it's really good that you're embracing the slightly unorthodox positions of sticking to ambitious convictions and acknowledging that this is unorthodox. I also really like your (a)-(d) (and agree that many of the adherents of the fields you list would benefit from similar lines of thinking).
I think we largely agree, and much of our disagreement probably boils down to where we draw the boundary between “mechanistic interpretability” and “other”. In particular, I fully agree with the first zoom level in your post, and with the causal structure of much of the rest of the diagram -- in particular, I like your notions of alignment robustness and mechanism distinction (the latter of which I think is original to ARC) and I think they may be central in a good alignment scenario. I also think that some notion of LPE should be present. I have some reservations about ELK as ARC envisions it (also of the “too much backchaining” variety), but think that the first-order insights there are valuable.
I think the core cruxes we have are:
-
You write "Mechanistic interpretability has so far yielded very little in the way of beating baselines at downstream tasks". If I understand this correctly, you're saying it hasn't yet led to engineering improvements, either in capabilities or in "prosaic alignment" (at least compared to baselines like RLHF or "more compute").
While I agree with this, I think that this isn't the right metric to apply. Indeed if you applied this metric, most science would not count as progress. Darwin wouldn’t get credit until his ideas got used to breed better crops and Einstein’s relativity would count as unproductive until the A-bomb (and the theory-application gap is much longer if you look at early advances in math and physics). Rather, I think that the question to ask is whether mechinterp (writ large, and in particular including a lot of people working in deep learning with no contact with safety) has made progress in understanding the internal functioning of AI or made nontrivially principled and falsifiable predictions about how it works. Here we would probably agree that the answer is pretty unambiguous. We have strong evidence that interesting semantic features exist in superposition (whether or not this is the way that the internal mechanisms use them). We understand the rough shape of some low-level circuits that do arithmetic and copying, and have rough ideas of the shapes of some high-level mechanisms (e.g. “function vectors”). To my eyes, this should count as progress in a very new science, and if I correctly understood your claim to be that you need to “beat black-box methods at useful tasks” to count as progress, I think this is too demanding.
-
I think that I’m onboard with you on your desideratum #1 that theories should be “primarily mathematical” – in the sense that I think our tastes for rigor and principled theoretical science are largely aligned (and we both agree that we need good and somewhat fundamental theoretical principles to avoid misalignment). But math isn’t magic. In order to get a good mathematical tool for a real-world context, you need to make sure that you have correctly specified the context where it is to be applied, and more generally that you’ve found the “right formal context” for math. This makes me want to be careful about context before moving on to your insight #2 of trying to guess a specific information-theoretic criterion for how to formalize "an interpretation". Math is a dance, not a hammer: if a particular application of math isn’t working, it’s more likely that your context is wrong and you need to retarget and work outwards from simple examples, rather than try harder and route around contradictions. If you look at even a very mathy area of science, I would claim that most progress did not come from trying to make a very ambitious theoretical picture work and introducing epicycles in a “builder-breaker” fashion to get around roadblocks. For example if you look at the most mathematically heavy field that has applications in real life, this is QFT and SFT (which uses deep algebraic and topological insights and today is unquestionably useful in computer chips and the like). Its origin comes from physicists observing the idea of “universality” in some physical systems, and this leading Landau and others to work out that a special (though quite large and perturbation-invariant) class of statistical systems can be coarse-grained in a way that leads to these observed behaviors, and this led to ideas of renormalization, modern QFT and the like. If Landau’s generation instead tried to work really hard on mathematically analyzing general magnet-like systems without working up from applications and real-world systems, they’d end up in roughly the same place as Stephen Wolfram of trying to make overly ambitious claims about automata. The importance of looking for good theory-context fit is the main reason I would like to see more back-and-forth between more “boots-on-the-ground” interpretability theorists and more theoretical agendas like ARC and Agent Foundations. I’m optimistic that ARC’s mathematical agenda will eventually start iterating on carefully thinking about context and theory-context fit, but I think that some of the agenda I saw had the suboptimal, “use math as a hammer” shape. I might be misunderstanding here, and would welcome corrections.
-
More specifically about “stories”, I agree with you that we are unlikely to be able to tell an easy-to-understand story about the internal working of AI’s (and in particular, I am very onboard with your first-level zoom of scalable alignment). I agree that the ultimate form of the thing we’re both gesturing at in the guise of “interpretability” will be some complicated, fractally recursive formalism using a language we probably don’t currently possess. But I think this is sort of true in a lot of other science. Better understanding leads to formulas, ideas and tools with a recursive complexity that humanity wouldn’t have guessed at before discovering them (again, QFT/SFT is an example). I’m not saying that this means “understanding AI will have the same type signature as QFT/ as another science”. But I am saying that the thing it will look like will be some complicated novel shape that isn’t either modern interp or any currently-accessible guess at its final form. And indeed, if it does turn out to take the shape of something that we can guess today – for example if heuristic arguments or SAEs turn out to be a shot in the right direction – I would guess that the best route towards discovering this is to build up a pluralistic collection of ideas that both iterate on creating more elegant/more principled mathematical ideas and iterate on understanding iteratively more interesting pieces of iteratively more general ML models in some class that expands from toy or real-world models. The history of math also does include examples of more "hammer"-like people: e.g. Wiles and Perelman, so making this bet isn't necessarily bad, and my criticism here should not be taken too prescriptively. In particular, I think your (a)-(d) are once again excellent guardrails against dangerous rabbitholes or communication gaps, and the only thing I can recommend somewhat confidently is to keep applicability to get interesting results about toy systems as a desideratum when building up the ambitious ideas.
Going a bit meta, I should flag an important intuition that we likely diverge on. I think that when some people defend using relatively formal math or philosophy to do alignment, they are going off of the following intuition:
-
if we restrict to real-world systems, we will be incorporating assumptions about the model class
-
if we assume these continue to hold for future systems by default, we are assuming some restrictive property remains true in complicated systems despite possible pressure to train against it to avoid detection, or more neutral pressures to learn new and more complex behaviors which break this property.
-
alternatively, if we try to impose this assumption externally, we will be restricting ourselves to a weaker, “understandable” class of algorithms that will be quickly outcompeted by more generic AI.
The thing I want to point out about this picture is that this models the assumption as closed. I.e., that it makes some exact requirement, like that some parameter is equal to zero. However, many of the most interesting assumptions in physics (including the one that made QFT go brrr, i.e., renormalizability) are open. I.e., they are some somewhat subtle assumptions that are perturbation-invariant and can’t be trained out (though they can be destroyed – in a clearly noticeable way – through new architectures or significant changes in complexity). In fact, there’s a core idea in physical theory, that I learned from some lecture notes of Ludvig Faddeev here, that you can trace through the development of physics as increasingly incorporating systems with more freedoms and introducing perturbations to a physical system starting with (essentially) classical fluid mechanics and tracing out through quantum mechanics -> QFT, but always making sure you’re considering a class of systems that are “not too far” from more classical limits. The insight here is that just including more and more freedom and shifting in the directions of this freedom doesn’t get you into the maximal-complexity picture: rather, it gets you into an interesting picture that provably (for sufficiently small perturbations) allows for an interesting amount of complexity with excellent simplifications and coarse-grainings, and deep math.
Phrased less poetically, I’m making a distinction between something being robust and making no assumptions. When thinking mathematically about alignment, what we need is the former. In particular, I predict that if we study systems in the vicinity of realistic (or possibly even toy) systems, even counting on some amount of misalignment pressure, alien complexity, and so on, the pure math we get will be very different – and indeed, I think much more elegant – than if we impose no assumptions at all. I think that someone with this intuition can still be quite pessimistic, can ask for very high levels of mathematical formalism, but will still expect a very high amount of insight and progress from interacting with real-world systems.
I think this is a really good and well-thought-out explanation of the agenda.
I do still think that it's missing a big piece: namely in your diagram, the lowest-tier dot (heuristic explanations) is carrying a lot of weight, and needs more support and better messaging. Specifically, my understanding having read this and interacted with ARC's agenda is that "heuristic arguments" as a direction is highly useful. But while it seems to me that the placement of heuristic arguments at the root of this ambitious diagram is core to the agenda, I haven't been convinced that this placement is supported by any results beyond somewhat vague associative arguments.
As an extreme example of this, Stephen Wolfram believes he has a collection of ideas building on some thinking about cellular automata that will describe all of physics. He can write down all kinds of causal diagrams with this node in the root, leading to great strides in our understanding of science and the cosmos and so on. But ultimately, such a diagram would be making the statement that "there exists a productive way to build a theory of everything which is based on cellular automata in a particular way similar to how he thinks about this theory". Note that this is different from saying that cellular automata are interesting, or even that a better theory of cellular automata would be useful for physics, and requires a lot more motivation and scientific falsification to motivate.
The idea of heuristic arguments is, at its core, a way of generalizing the notion of independence in statistical systems and models of statistical systems. It's discussing a way to point at a part of the system and say "we are treating this as noise" or "we are treating these two parts as statistically independent", or "we are treating these components of the system as independently as we can, given the following set of observations about our system" (with a lot of the theory of HA asking how to make the last of these statements explicit/computable). I think this is a productive class of questions to think about, both theoretically and empirically. It's related to a lot of other research in the field (on causality, independence and so on). I conceptually vibe with ARC's approach from what I've seen of the org. (Modulo the corrigible fact that I think there should be a lot more empirical work on what kinds of heuristic arguments work in practice. For example what's the right independence assumption on components of an image classifier/ generator NN that notices/generates the kind of textural randomness seen in a cat's fur? So far there is no HA guess about this question, and I think there should be at least some ideas on this level for the field to have a healthy amount of empiricism.)
I think that what ARC is doing is useful and productive. However, I don't see strong evidence that this particular kind of analysis is a principled thing to put at the root of a diagram of this shape. The statement that we should think about and understand independence is a priori not the same as the idea that we should have a more principled way of deciding when one interpretation of a neural net is more correct than another, which is also separate from (though plausibly related to) the (I think also good) idea in MAD/ELK that it might be useful to flag NN's that are behaving "unusually" without having a complete story of the unusual behavior.
I think there's an issue with building such a big structure on top of an undefended assumption, which is that it is creates some immissibility (i.e., difficulty of mixing) with other ideas in interpretability, which are "story-centric". The phenomena that happen in neural nets (same as phenomena in brains, same as phenomena in realistic physical systems) are probably special: they depend on some particular aspects of the world/ of reasoning/ of learning that has some sophisticated moving parts that aren't yet understood (some standard guesses are shallow and hierarchical dependence graphs, abundance of rough symmetries, separation of scale-specific behaviors, and so on). Our understanding will grow by capturing these ideas in terms of suitably natural language and sophistication for each phenomenon.
[added in edit] In particular (to point at a particular formalization of the general critique), I don't think that there currently exists a defendable link between Heuristic Arguments and the proof verification as in Jason Gross's excellent paper. The specific weakening of the notion of proof verification is more general interpretability. Your post on surprise accounting, is also excellent, but it doesn't explain how heuristic arguments would lead to understanding systems better -- rather, it shows that if we had ways of making better independence assumptions about systems with an existing interpretation, we would get a useful way of measuring surprise and explanatory robustness (with proof a maximally robust limit). But I think that drawing the line from seeking explanations with some nice properties/ measurements to the statement that a formal theory of such properties would lead to an immediate generalization of proof/interpretability which is strictly better than the existing "story-centric" methods is currently undefended (similar to the story that some early work on causality in interp had that a good attempt to formalize and validate causal interpretations would lead to better foundations of interp. -- the techniques are currently used productively e.g. here, but as an ingredient of an interpretation analysis rather than the core of the story). I think similar critiques hold for other sufficiently strong interpretations of the other arrows in this post. Note that while I would support a weaker meaning of arrows here (as you suggest in a footnote), there is nevertheless a core implicit assumption that the diagram exists as a part of a coherent agenda that deduces ambitious conclusions from a quite specific approach to interpretability. I could see any of the nodes here as being a part of a reasonable agenda that integrates with mechanistic interpretability more generally, but this is not the approach that ARC has followed.
I think that the issue of the approach sketched here is that it overindexes on a particular shape of explanation -- namely, that the most natural way to describe the relevant details inherent in principled interpretability work will most naturally factorize through a language that grows out of better-understanding independence assumptions in statistical modeling. I don't see much evidence for this being the case, any more than I see evidence that the best theory of physics should grow out of a particular way of seeing cellular automata (and I'd in fact bet with some confidence that this is not true in both of these cases). At the same time I think that ARC ideas are good, and that trying to relate them to other work in interp is productive (I'm excited about the VAE draft in particular). I just would like to see a less ambitious, more collaboratively motivated version of this, which is working on improving and better validating the assumptions one could make as part of mechanistic/statistical analysis of a model (with new interpretability/MAD ideas as a plausible side-effect) rather than orienting towards a world where this particular direction is in some sense foundational for a "universal theory of interpretability".
I don't think this is the whole story, but part of it is surely that a person motivating their actions by "wanting to be happy" is evidence for them being less satisfied/ happy than baseline
In particular, it's not hard to produce a computable function that isn't given by a polynomial-sized circuit (parity doesn't work as it's polynomial, but you can write one down using diagonalization -- it would be very long to compute, but computable in some suitably exponentially bounded time). But P vs. NP is not about this: it's a statement that exists fully in the world of polynomially computable functions.
Looking at this again, I'm not sure I understand the two confusions. P vs. NP isn't about functions that are hard to compute (they're all polynomially computable), rather functions that are hard to invert, or pairs of easily computable functions that hard to prove are equal/not equal to each other. The main difference between circuits and Turing machines is that circuits are finite and bounded to compute whereas the halting time of general Turing machines is famously impossible to determine. There's nothing special about Boolean circuits: they're an essentially complete model of what can be computed in polynomial time (modulo technicalities)
looks like you referenced the same paper before me while I was making my comment :)
Yeah I think this is a good place to probe assumptions, and it's probably useful to form world models where you probability of P = NP is nonzero (I also like doing this for inconsistency of logic). I don't have an inside view, but like Scott Aaronson on this: https://www.scottaaronson.com/papers/pnp.pdf:
Kinda silly to do this with an idea you actually care about, especially if political (which would just increase the heat:light ratio in politics along the grain for Russian troll factories etc.). But carefully trying to make NN traps with some benign and silly misinformation -- e.g. "whales are fish" or something -- could be a great test to see if weird troll-generated examples on the internet can affect the behavior
Maybe I'll add two addenda:
-
It's easy to confuse entropy with free energy. Since energy is conserved, globally the two measure the same thing. But locally, the two decouple, and free energy is the more relevant parameter here. Living processes often need to use extra free energy to prevent the work they are interested in doing from getting converted into heat (e.g. when moving we're constantly fighting friction); in this way we're in some sense locally increasing free energy.
-
I think a reasonable (though imperfect) analogy here is with potential energy. Systems tend to reduce their potential energy, and thus you can make a story that, in order to avoid just melting into a puddle on the ground, life needs to constantly fight the tendency of gravitational potential energy to be converted to kinetic energy (and ultimately heat). And indeed, when we walk upright, fly, build skyscrapers, use hydro power, we're slowing down or modifying the tendency of potential energy to become kinetic. But this is in no sense the fundamental or defining property of life, whether we're looking globally at all matter or locally at living beings. We sometimes burrow into the earth, flatten mountains, etc. While life both (a), can use potential energy of other stuff to power its engines and (b), needs to at least somewhat fight the tendency of gravitational kinetic energy to turn it into a puddle of matter without any internal structure, this is just one of many physical stories about life and isn't "the whole story".
I think one shouldn't think of entropy as fundamentally preferred or fundamentally associated with a particular process. Note that it isn't even a well-defined parameter unless you posit some macrostate information and define entropy as a property of a system + the information we have about it.
In particular, life can either increase or decrease appropriate local measurements of entropy. We can burn the hydrocarbons or decay the uranium to increase entropy or we can locally decrease entropy by changing reflectivity properties of earth's atmosphere, etc.
The more fundamental statement, as jessicata explains, is that life uses engines. Engines are trying to locally produce energy that does work rather than just heat, i.e., that has lower entropy compared to what one would expect from a black body. This means that they have to use free energy, which corresponds tapping into aspects of the surrounding environment where entropy has not yet been maximized (i.e., which are fundamentally thermodynamic rather than thermostatic), and they also have to generate work which is not just heat (i.e., they can't just locally maximize the entropy). Life on earth mostly does this by using the fact that solar radiation is much higher-frequency than black-body radiation associated to temperatures on Earth, thus contains free energy (that can be released by breaking it down).
This is awesome!
I also wouldn't give this result (if I'm understanding which result you mean) as an example where the assumptions are technicalities / inessential for the "spirit" of the result. Assuming monotonicity or commutativity (either one is sufficient) is crucial here, otherwise you could have some random (commutative) group with the same cardinality as the reals.
Generally, I think math is the wrong comparison here. To be fair, there are other examples of results in math where the assumptions are "inessential for the core idea", which I think is what you're gesturing at. But I think math is different in this dimension from other fields, where often you don't lose much by fuzzing over technicalities (in fact the question of how much to fuss over technicalities like playing fast and loose with infinities or being careful about what kinds of functions are allowed in your fields is the main divider between math and theoretical physics).
In my experience in pure math, when you notice that the "boilerplate" assumptions on your result seem inessential, this is usually for one of the following reasons:
- In fact, a more general result is true and the proof works with fewer/weaker assumptions, but either for historical reasons or for reasons of some results used (lemmas, etc.) being harder in more generality, it's stated in this form
- The result is true in more generality, but proving the more general result is genuinely harder or requires a different technique, and this can sometimes lead to new and useful insights
- The result is false (or unknown) in more technicality, and the "boilerplate" assumptions are actually essential, and understanding why will give more insight into the proof (despite things seeming inessential at first)
- The "boilerplate" assumptions the result uses are weaker than what the theorem is stated with, but it's messy to explain the "minimal" assumptions, and it's easier to compress the result by using a more restrictive but more standard class of objects (in this way a lot of results that are true for some messy class of functions are easier to remember and use for a more restrictive class: most results that use "Schwartz spaces" are of this form; often results that are true for distributions are stated for simplicity for functions, etc.).
- Some assumptions are needed for things to "work right," but are kind of "small": i.e., trivial to check or mostly just controlling for degenerate edge cases, and can be safely compressed away in your understanding of the proof if you know what you're doing (a standard example is checking for the identity in group laws: it's usually trivial to check if true, and the "meaty" part of the axiom is generally associativity; another example is assuming rings don't have 0 = 1, i.e., aren't the degenerate ring with one element).
- There's some dependence on logical technicalities, or what axioms you assume (especially relevant in physics- or CS/cryptography- adjacent areas, where different additional axioms like P != NP are used, and can have different flavors which interface with proofs in different ways, but often don't change the essentials).
I think you're mostly talking about 6 here, though I'm not sure (and not sure math is the best source of examples for this). I think there's a sort of "opposite" phenomenon also, where a result is true in one context but in fact generalizes well to other contexts. Often the way to generalize is standard, and thus understanding the "essential parts" of the proof in any one context are sufficient to then be able to recreate them in other contexts, with suitably modified constructions/axioms. For example, many results about sets generalize to topoi, many results about finite-dimensional vector spaces generalize to infinite-dimensional vector spaces, etc. This might also be related to what you're talking about. But generally, I think the way you conceptualize "essential vs. boilerplate" is genuinely different in math vs. theoretical physics/CS/etc.
Nitpick, but I don't think the theorem you mention is correct unless you mean something other than what I understand. For the statement I think you want to be true, the function also needs to be a group law, which requires associativity. (In fact, if it's monotonic on the reals, you don't need to enforce commutativity, since all continuous group laws on R are isomorphic.)
This is very cool!
Right - looking at energy change of the exhaust explains the initial question in the post: why energy is preserved when a rocket accelerates, despite apparently expending the same amount of fuel for every unit of acceleration (assuming small fuel mass compared to rocket). Note that this doesn't depend on a gravity well - this question is well posed, and well answered (by looking at the rocket + exhaust system) in classical physics without gravity. The Oberth phenomenon is related but different I think
Hi! As I commented on your other post: I think this is a question for https://mathoverflow.net/ or https://math.stackexchange.com/ . This question is too technical, and does not explain a connection to alignment. If you think this topic is relevant to alignment and would be interesting to technical people on LW, I would recommend making a non-technical post that explains how you think results in this particular area of analysis are related to alignment.
Hi! I think this is a question for https://mathoverflow.net/ or https://math.stackexchange.com/ . While Lesswrong has become a forum for relatively technical alignment articles, this question is too math-heavy, and it has not been made clear how this is relevant to alignment. The forum would get too crowded if very technical math questions became a part of the standard content.
I think it's very cool to play with token embeddings in this way! Note that some of what you observe is, I think, a consequence of geometry in high dimensions and can be understood by just modeling token embeddings as random. I recommend generating a bunch of tokens as a Gaussian random variable in a high-dimensional space and playing around with their norms and their norms after taking a random offset.
Some things to keep in mind, that can be fun to check for some random vectors:
- radii of distributions in high-dimensional space tend to cluster around some fixed value. For a multivariate Gaussian in n-dimensional space, it's because the square radius is a sum of squares of Gaussians (one for each coordinate). This is a random variable with mean O(n) and standard deviation . In your case, you're also taking a square root (norm vs. square norm) and normalization is different, but the general pattern of this variable becoming narrow around a particular band (with width about compared to the radius) will hold.
- a random offset vector will not change the overall behavior (though it will change the radius).
- Two random vectors in high-dimensional space will be nearly orthogonal.
On the other hand it's unexpected that the mean is so large (normally you would expect the mean of a bunch of random vectors to be much smaller than the vectors themselves). If this is not an artifact of the training, it may indicate that words learn to be biased in some direction (maybe a direction indicating something like "a concept exists here"). The behavior of tokens near the center-of-mass also seems really interesting.
I think there is some misunderstanding of what SLT says here, and you are identifying two distinct notions of complexity as the same, when in fact they are not. In particular, you have a line
"The generalisation bound that SLT proves is a kind of Bayesian sleight of hand, which says that the learning machine will have a good expected generalisation relative to the Bayesian prior that is implicit in the learning machine itself."
I think this is precisely what SLT is saying, and this is nontrivial! One can say that a photon will follow a locally fastest route through a medium, even if this is different from saying that it will always follow the "simplest" route. SLT arguments always works relative to a loss landscape, and interpreting their meaning should (ideally) be done relative to the loss landscape. The resulting predictions are, nevertheless, nontrivial, and are sometimes confirmed. For example we have some work on this with Nina Rimsky.
You point at a different notion of complexity, associated to considering the parameter-function map. This also seems interesting, but is distinct from complexity phenomena in SLT (at least from the more basic concepts like the RLCT), and which is not considered in the basic SLT paradigm. Saying that this is another interesting avenue of study or a potentially useful measure of complexity is valid, but is a priori independent of criticism of SLT (and of course ideally, the two points of view could be combined).
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning. For example it's not clear in your example why f(x) = 0 is likely to be learned (unless you have weight regularization). Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
Also I wanted to point out that in some ways, your "actual solution" is very close to the definition of RLCT from SLT. The definition of the RLCT is how much entropy you have to pay (in your language, the change in negative log probability of a random sample) to gain an exponential improvement of loss precision; i.e., "bits of specification per bit of loss". See e.g. this article.
The thing is, the "complexity of f" (your K(f)) is not a very meaningful concept from the point of view of a neural net's learning (you can try to make sense of it by looking at something like the entropy of the weight-to-function mapping, but then it won't interact that much with learning dynamics). I think if you follow your intuitions carefully, you're likely to precisely end up arriving at something like the RLCT (or maybe a finite-order approximation of the RLCT, associated to the free energy).
I have some criticisms of how SLT is understood and communicated, but I don't think that the ones you mention seem that important to me. In particular, my intuition is that for purposes of empirical measurement of SLT parameters, the large-sample limit of realistic networks is quite large enough to see approximate singularities in the learning landscape, and that the SGD-sampling distinction is much more important than many people realize (indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct).
My main update in this field is that people should be more guided by empiricism and experiments, and less by competing paradigms of learning, which tend to be oversimplified and to fail to account for messy behaviors of even very simple toy networks. I've been pleasantly surprised by SLT making the same update in recent months.
Interesting - what SLT prediction do you think is relevant here?
Noticed thad I didn't answer Kaarel's question there in a satisfactory way. Yeah - "basin" here is meant very informally as a local piece of the loss landscape with lower loss than the rest of the landscape, and surrounding a subspace of weight space corresponding to a circuit being on. Nina and I actually call this a "valley" our "low-hanging fruit" post.
By "smaller" vs. "larger" basins I roughly mean the same thing as the notion of "efficiency" that we later discuss
In particular, in most unregularized models we see that generalize (and I think also the ones in omnigrok), grokking happens early, usually before full memorization (so it's "grokking" in the redefinition I gave above).
Oh I can see how this could be confusing. We're sampling at every step in the orthogonal complement to the gradient at that step ("initialization" here refers to the beginning of sampling, i.e., we don't update the normal vector during sampling). And the reason to do this is that we're hoping to prevent the sampler from quickly leaving the unstable point and jumping into a lower-loss basin (by restricting we are guaranteeing that the unstable point is a critical point)
Sorry, I misread this. I read your question as O outputting some function T that is most likely to answer some set of questions you want to know the answer to (which would be self-referential as these questions depend on the output of T). I think I understand your question now.
What kind of ability do you have to know the "true value" of your sequence B?
If the paperclip maximizer P is able to control the value of your turing machine, and if you are a one-boxing AI (and this is known to P) then of course you can make deals/communicate with P. In particular, if the sequence B is generated by some known but slow program, you can try to set up an Arthur-Merlin zero knowledge proof protocol in exchange for promising to make a few paperclips, which you can then use to keep P honest (after making the paperclips as promised).
To be clear though, this is a strategy for an agent A that somehow has as its goals only the desire to compute B together with some kind of commitment to following through on agreements. If A is genuinely aligned with humans, the rule "don't communicate/make deals with malicious superintelligent entities, at least until you have satisfactorily solved the AI in a box and similar underlying problems" should be a no-brainer.
Looks like you're making a logical error. Creating a machine that solves the halting problem is prohibited by logic. For many applications assuming a sufficiently powerful and logically consistent oracle is good enough but precisely these kinds of games you are playing, where you ask a machine to predict its own output/the output of a system involving itself, are where you get logically inconsistent. Indeed, imagine asking the oracle to simulate an equivalent version of itself and to output the the opposite answer to what its simulation outputs. This may seem like a derived question, but most "interesting" self-referential questions boil down to an instance of this. I think once you fix the logical inconsistency, you're left with an equivalent problem to AI in a box: boxed AI P is stronger that friendly AI A but has an agenda.
Alternatively, if you're assuming A is itself un-aligned (rather than friendly) and has the goal of getting the right answer at any cost then it looks like you need some more assumptions on A's structure. For example if A is sufficiently sophisticated and knows it has access to a much more powerful but untrustwothy oracle it might know to implement a merlin-arthur protocol.
Not sure but doubt it: IIRC, copper kills by catalysing intra-cellular reactions, which are slow (compared to salt, which should have near-instantaneous mechanism of action since it can blow up membranes). Also I would be worried about safety of breathing copper. But I might be wrong about this!
I've looked at a small amount of data on this question. I think it's a really important one (see a related question of mine), but am extremely not an expert. The most actionable item is this study that essentially "salting" a surgical mask might make it significantly more protective against flu viruses. The study's in vivo section with mice strikes me as a bit sketchy (small n, and unclear how representative of mask filtration their mouse procdeure actually is), but their in vitro section seems legit, and the study is in Scientific Reports (part of the Nature publishing group). If you're making a DIY mask/filter and it's not too thick already, it can't hurt to include a salted layer. Their proposed mechanism of action is by the salt killing the virus particles, not filtering them, so it should stack well with particulate filters. The recipe in the paper is to coat a hydrophobic filter in solution of salt and surfactant (they used polysorbate 20, which is approved to use as a food additive), then let it dry.
What makes you say England did not have looting during WW2? England had more cohesion. But that is just one factor impacting people's behavior. Someone who is desperate or immoral enough to loot in wartime is unlikely to be seriously swayed by the need for patriotic unity. Other factors, which I think are bigger, are severity of need and enforcement. Don't know about enforcement, but it is very hard for me to envision a scenario where meeting basic needs is harder and than in WW2 Britain.
I've done a little research about the food supply chain specifically. Presumably certain supply chains will be similar, certain ones will be different. Also note I am very much not an expert. The basic fact is that there is "enough food" but prices may rise and getting food may be worse. I think there are three key parameters, which could go either way:
(1) Hoarding/instability. Worst case scenario: people panic. People stockpile giant supplies of food. Food goes bad. People buy more food. Food gets prohibitively expensive. Best case scenario: supermarket situation stabilizes, panicky people feel like they have enough non-perishables stockpiled, most last-mile (grocery store) product shortages stop.
(2) Protectionism. This will be less dangerous in the US which exports more food than it imports. But certain countries, especially poorer countries that rely significantly on imports, will suffer if a global panic causes protectionist policies about food (e.g. wheat exporter Kazakhstan apparently stopped exporting grain because of coronavirus fears, see this article ). This is understandable, but probably bad. Here the best case according to this article is if big markets actively work to stabilize the market and punish protectionism (but the economics here is above my pay grade).
(3) Worker/driver issues. This mostly depends on "how freaked out blue-collar workers get". Currently most truck drivers, clerks, etc., are risking infection in exchange for a steady job. If things get bad (for example if there are wide-spread hospital bed shortages and fatality goes through the roof) *and younger people become afraid* (a big if), a big proportion of chain workers will take losing their job over getting infected. This would probably raise prices.
It's important to stress that it's *very unlikely* that anything catastrophic happens in developed countries like the US, and the worst-case scenario is government rationing. The example to keep in mind is WW2 Britain (I originally linked the wrong article here, which is also an interesting read ). Nevertheless, with rationing people survived basically healthy for several years of war.
A question I always have about these studies is at what level symptoms are defined and self-reported. E.g. presumably "you have an itchy throat or a mild headache in the morning/mildly increased fever over your baseline" is pre-symptomatic. Self-isolating with mild symptoms is probably hard to measure but can be at least socially enforced.
DP Cruise didn't have any fatalities under age 70, so not sure where you're getting the under-29 number. Also since the population is older the case fatality was over-estimated. This study https://cmmid.github.io/topics/covid19/severity/diamond_cruise_cfr_estimates.html?fbclid=IwAR2jCOZcBGHYBWC_dqSzwvX7T7-DOpwm8L84qqW8k6QtKa05Inv35Pk3Ezs estimates adjusted CFR form DP cruise ship data (assuming treatment!) to be .5%, largely in agreement with other numbers I'd heard. Though the sample size is ridiculously small, so the error bounds are terrible.
Advice: drink a mouthful of water every 15 minutes. This is speculative (facebook post from a friend of a friend). The rationale is that if you have virus particles in your mouth, rinsing them into your stomach (where the stomach acid kills them) will prevent them from getting into your respiratory system. [edit: retracted, seems to be downstream from a fake news article. Drinking water is still good, but looks like this pathway is not realistic]
Advice: now may be a good time to learn to meditate. Deaths from coronavirus are due mostly to breathing problems from pneumonia, which is the main explanation for why older people are more likely to die. There is evidence that meditation is good for pneumonia specifically http://www.annfammed.org/content/10/4/337.full and lowers oxygen consumption generally https://journals.sagepub.com/doi/full/10.1177/2156587213492770. I didn't read the studies carefully to see how trustworthy they are, but this conforms well with my understanding and limited experience of meditation. Meditation is also known to be good for mitigating stress, which will obviously be beneficial in the coming months.