Steelmanning heuristic arguments

post by Dmitry Vaintrob (dmitry-vaintrob) · 2025-04-13T01:09:33.392Z · LW · GW · 0 comments

Contents

  Introduction
  Background and motte/bailey criticism
  The missing piece: connecting in the “no-coincidence principle” 
    From the "no coincidence principle" to statistical explanations
    Gödel and the thorny deeps
    "Ignoring the monsters" and the heuristic arguments agenda
    Upshots
    Summary
    Renormalization as a cousin of heuristic arguments
None
No comments

Introduction

This is a nuanced “I was wrong” post.

Something I really like about AI safety and EA/rationalist circles is the ease and positivity in people’s approach to being criticised [EA · GW].[1] For all the blowups and stories of representative people in the communities not living up to the stated values, my experience so far has been that the desire to be truth-seeking and to stress-test your cherished beliefs is a real, deeply respected and communally cultured value. This in particular explains my ability to keep getting jobs and coming to conferences in this community, despite being very eager to criticise [LW · GW] and call bullshit on people’s theoretical agendas. 

One such agenda that I’ve been a somewhat vocal critic of (and which received my criticism amazingly well) is the “heuristic arguments [LW · GW]” picture and the ARC research agenda more generally. Last Spring I spent about 3 months on a work trial/internship at ARC Research. I really liked the people I worked with (this is not a polite pro forma compliment – I really think they’re talented, competent, and good to work with to an unusual degree) and much of their object-level research. But I came out really not buying the underlying agenda on a very basic level. I went in with a bit of skepticism, but a desire to see if I can get the “core idea” and steelman what they’re doing. I came out with a strong sense that what they’re doing is subject to a motte-and-bailey fallacy: essentially, their stated goals were to develop a language the combines statistical arguments (the part of their work dealing with “presumption of independence” and maximal-entropy inversions of marginalization) with logical chains in a seamless whole: something they putatively call the “heuristic argument” formalism. My sense was that they are fully wrong: the two are entirely separate, and the places they kept rolling out as “inspirations” (such as the “twin primes” argument, a baby version of which I’ll explain below) were malformed: accidental couplings which are adjacent to both statistics and logic, but not at all in a way that would be evidence for a combined system like this existing (as an analogy, you could say that DNA is both interesting from a chemical and from an information-theoretic perspective, but this doesn’t imply that there is some overarching theory that combines chemistry and information theory of which DNA is a cornerstone). 

I also had a number of cryptography- and logic-based issues with various other bits of their agenda. While I still endorse some of my criticisms, I was wrong about my blanket disagreement that they have (gesturally, but interestingly) identified an interesting class of questions that combine logic and statistics in a new way, and point at an interesting (and potentially useful) missing theory. I’ve been convinced that the “Motte and Bailey” are actually parts of the same castle. The thing that brought about this change of mind was a conversation with Scott Aaronson last week, where he explained some of the basic components to me in a new way, and some thinking I did afterwards. I’m not sure if the reason talking to Scott helped was that we share a common academic-mathematical language that dissolved some of my core confusions, or the fact that Scott is someone I’ve always respected, and wouldn’t expect to hold silly views (if it is the latter, this is particularly embarrassing – frankly, it’s probably a combination of both). 

Background and motte/bailey criticism

First, let me develop a little the heuristic arguments idea as I understand it and the disagreement that I changed my mind on.

In math we have, on the one hand, logic. This is a formalization of the idea that some sentences are true or false: “there are infinitely many primes” is true, and so on. On the other hand, we have statistics. Often logical statements can be converted to statistical statements. For a very baby statement of this type, one can say that 

“a random number is even with probability 50%”. 

This is actually not a rigorous statement yet, but there is a standard way of converting probabilistic statements like this to rigorous ones. In the rest of the post, I’ll assume this conversion implicitly, but in the even numbers example I’ll give the rigorous version as well, just to fix assumptions. So the statement above translates to:

“the probability of a random number between 1 and n being even approaches as n approaches ."

Often, these probabilities behave like independent variables and multiply: so the probability that a number is both even and divisible by 3 is (to prove this, just note that a number is divisible by 2 and 3 iff it is divisible by 6). But this doesn’t always hold. So for instance, the probability that a number is divisible by both 6 and by 4 is not , but rather . Here we can give a “reason” for this failure of independence, since if a number is divisible by either 4 or 6, it must be even. Conditionally on being even, it has a probability of of being divisible by and of being divisible by 6, and these conditional probabilities do multiply (an even number has a probability of of being divisible by both 4 and 6), so in this case we have conditional independence after observing the “both groups of numbers are even” confounder. Not all mathematical statements follow these heuristic arguments: for instance when counting rational points on certain elliptic curves, you would be consistently surprised by how many there are compared to what you’d guess from the baby “heuristic intuitions” above of (e.g.) looking modulo primes, before you understand some quite sophisticated additional structure of the specific problem (in this case, a group structure). More problematically, these statical arguments never yield proofs: proving the conditional independence of divisibility-by-4 and divisibility-by-6 is equivalent to proving that the joint condition of the two statements being true is equivalent to divisibility-by-12. Still, this kind of “statistical reasoning” is often used, and used fruitfully, to come up with hypotheses in number theory, and these hypotheses are often either proven or empirically seen to hold for numbers up to (however high modern computers can check). 

On the other hand we have statistical models. In a statistical model, you have some assumptions – such as conditional independence – baked into a model (think “pandemic growth model”). You then assume that there are some latent parameters you don’t know in the data, and choosing these picks out a specific data distribution from a large class of probability distributions “in the model”. For instance, even numbers are just one of (exponentially many) classes of data drawn from the model “a subset of 500,000 numbers between 1 and 1,000,000” – the latter enormous “bag of distributions” is a statistical model. (Typically, the choice of “model” is operationalized as randomly drawing some “latent” data-generation variable.) Now typically a model bakes in a lot more than conditional independence. In some sense representative way of getting statistical models is to fill in the “blank” weights of a neural net, and to assume that the data is drawn in some way associated with the weights. It’s very hard to make very precise logical statements about enormously complex models like this (and this is in some sense a core part of why they are expressive enough to capture interesting aspects of our messy reality). But in another sense, it seems to be true that if we massively simplify the model and marginalize out (i.e. “ignore”) almost all variables in a suitably generic way, we’ll often get models that are well-approximated by much simpler and more uniform laws. For instance, randomly selected pairs of variables are often independent; randomly selected continuous variables are often Gaussian (in fact, the law of large numbers gives a rigorous reason for this to be the case in a certain operationalization). Thus the kind of simple “heuristic” ideas of “well, these variables are basically independent, once you account for this one obvious confounder” seems to work surprisingly well in statistical models of the world, in a way that is somewhat analogous to how they often work “in practice” in mathematics. 

While this is a cool and useful observation, the thing I would naturally take away from this is simply that “simple models are often preferred” – whether this be in nature, in random subsets of multivariable systems, or in math. Productive to assume in practice and generate mathematical hypotheses, but not much more formalizable than “pi is often a good answer on multiple-choice exams in trigonometry”. The fact that math is one of the instances where such nice distributions occur often didn’t seem too deep to me or indicative of a deeper connection. After all, the predictable randomness in cat coloration doesn’t mean that there exists an overarching theory of statistics and cats (to be fair, I think this is still a pretty strong criticism that isn’t sufficiently addressed by ARC’s messaging).

And then I saw that the ARC team wanted to take these two observations and run with them a seemingly ridiculous distance. “What if instead of logic and statistics being two contexts where simple distributions tend to occur often, one could instead formalize a language that has both of these instances as special cases, and moreover fully includes the formalism of proofs?” To me this seemed crazy, and despite a decent amount of groping for an explanation and asking people to explain it, by the end of my tenure at ARC I still didn’t have an answer beyond “these very smart people are high on their own supply”. 

The missing piece: connecting in the “no-coincidence principle” 

However I’ve realized I have been missing a big piece of the puzzle heuristic argument-wise. While working at ARC various people told me about the “no-coincidence principle”, also known colloquially as the “everything happens for a reason” principle, as a potentially important component of making heuristic arguments rigorous. I spent some time trying to understand the no-coincidence principle, and failed to get a coherent picture. I think that Paul formulated for me the formal setting described in Eric’s great recent post on the subject [LW · GW] to me, but at the time I didn’t grok the intuition around it. (And more recently, while I think the post and the paper it links to does a great job explaining it, the preliminary skim I subjected it to still didn’t give me the key intuition for why this is crucial for a heuristic arguments agenda). So I want to put front and center the point that I now see as central (and I hope Paul and other ARC theory people can explain the inevitable ways I am wrong here). The point as I see it is that there legitimately may be a context where proof and heuristic biases are on the same footing – and that even being very vague about such a context and allowing it as an “ethereal intuition pump” still leads to a class of very heuristic-arguments-shaped intuitions that do combine proof and statistical modelling assumptions. As many contexts of this shape, it is likely to run into gnarly logical issues with incompleteness (or at a minimum, cryptographic hardness) for proofs that go beyond some complexity threshold, something that I’ll explain below. But I think that still the directional connection between the principle and the ideology of heuristic arguments is key, and was, at least to me, a surprisingly slippery concept to grok. I’ll try to impressionistically explain the connection below. I’ll use more technical language, with the goal being of explaining the missing connection to someone like the past version of me that started at ARC. 

The basic idea is to work within a specific context of random boolean formulas. I don’t want to commit to a specific “architecture” for these formulas, but one can imagine them to be machine code (with a finite memory and clock) or boolean circuits. I’ll use the common terminology of “circuit” (i.e., a sequence of AND, OR and NOT gates on a large boolean data vector), with the understanding that there may be mild architectural restrictions to address issues with encoding randomness and related degeneracies of random circuits. Now a circuit C can be understood as encoded by some code, but it also implements a function C(x) of boolean inputs; we’ll sometimes use the expression when distinguishing the “code” C from its function. We assume we have some property P(f) of the function that this formula implements: for instance one possibility is that P is “always returns the zero vector”; we can also allow statistical properties, like “f outputs the 0 vector for 50% of inputs”. Now (for people who haven’t studied this subject), most functions can’t be implemented by a (polynomial-sized) boolean formula, by a simple counting argument: there are at least possible functions (even if the function has single-bit outputs, there is a choice of 0 or 1 for each input word). Since there are possible input words (for n the length), this is an exp-exp expression, and is larger by a ridiculous margin than the expression for counting the total number of circuits (which has a single exp). At the same time, if we’ve carefully chosen the architecture, we can kinda model a random boolean formula as a random function in some local sense (this is particularly neat for a class of so-called reversible circuits, where a version of this was recently rigorously proven[2] – an unusual state of affairs for complexity-theoretic results). Thus we can make a first guess that the probability of (the function implemented by) a random circuit of having some property P is approximately the probability of a random function having property P. The “no-coincidences principle” (and a starting point for heuristic arguments ideas) comes in in contexts where this approximation is phenomenally off-base. This is particularly true once we start considering properties P that include universal quantifiers, like “f(x) never outputs 00000” (here we’re assuming 5-bit output, and a much larger input), where it’s easy to engineer circuits that have this property but where the probability of a random function having this property is astronomically low even by complexity-theory standards (i.e., includes an exp-exp in its denominator). 

For such “mismatched” properties P, we can lower-bound the probability that a circuit has this property by coming up with “reasons” for it to happen: for instance, the “never outputting 00000” property would be true if part of the formula is “output 1 as the last bit” (expressed in whatever circuit language we’ve chosen). More generally, we can have models (think statistical models) of circuits that have property P with some probability. For instance we might observe that the last bit of our circuit output actually only depends on 3 of the input bits, so (assuming this component of the circuit is suitably conceptualized as random), the probability that it always outputs zero is a manageable (here is the number of possible 3-bit configurations), and we might have similar arguments for the other bits. 

From the "no coincidence principle" to statistical explanations

So far, this isn’t suggesting any theory beyond statistics: essentially, I’m making the relatively straightforward point that we can approximate relevant properties of circuits by statistical models that depend on the circuit specifics, and given such a model we can then use usual statistics (coupled with circuit analysis) to deduce properties of the resulting function. However, the interesting new intuition comes in if we ask for a complete (or more precisely complete-up-to-small-error) reason why property P holds more often than expected. Basically, a very optimistic hope for understanding “why circuits behave differently from functions” is to make statements that, conditional on some “surprising” property P holding for a circuit (here “surprising” means that it occurs much more for circuits than functions), we can deduce that, with high probability, the circuit internals have some property A(C). So roughly, we claim that (with high probability), there is an implication where is the function and A(C) is a “simple” property of the code C (like “the last line is ‘output zero’ ”). More generally, instead of A being a property, we can interpret it as a statistical model: so if we assume that C is drawn from some restricted distribution, of circuits like “each output bit is a random circuit applied to three input bits”, we can now model the probability of P being true conditional on the circuit C being drawn from A. The “no-coincidence principle” now conjectures that we can explain a large part of the “surprisingness” of a property P(f) holding by modeling the circuit C as drawn from some restricted distribution of circuits, with the “restriction” given by the distribution A (this can more or less be identified with what ARC calls a heuristic argument, if I’m not mistaken). 

Now making this (very strong!) assumption, we can hope to develop a theory of such properties A of code that “explain coincidences” (i.e. properties P) in input-output circuit behavior. Here from the very start, in appropriate regimes this will necessarily include proofs: for instance if a circuit always outputs 0, one possible reason for this is that a component of the circuit checks instances of the Riemann hypothesis, and the Riemann hypothesis is true. Since the Riemann hypothesis being false would invalidate this particular argument, we must know the answer to the Riemann hypothesis to fully explain possible reasons even for the “always outputs zero” property.

Gödel and the thorny deeps

This gets at an obvious complication in the picture: logic is hard. In fact it’s not just hard in the sense that NP hard problems are hard, but it’s hard in the deeper, thornier sense of Gödel and of implying its own incompleteness. So whether or not the Riemann hypothesis is provable, we know that there exist statements in the same class as the Riemann hypothesis which are true but cannot be proven. This means that some fraction of the “reasons” for surprising properties of circuits will always be mysterious. Note that this type of analysis always runs into some operationalizing issues that I’m eliding: we can’t simply assume that our circuits are of fixed size, since then “check all inputs” is a foolproof proof strategy that runs in “constant” (just probably very long) time, so we should take some asymptotic limit; but in this case, setting up the correct “problem class” and giving the quantifiers in the right order is a headache. There are some accepted ways of making this kind of thing rigorous that do end up running into incompleteness issues. So in some sense, a maximalist understanding of the “no coincidences” property for a sufficiently general class of circuit architectures is dead in the water, gone the same way as Hilbert’s second problem. More minor, but probably more consequential (for real-life problems) are cryptographic issues: there can be good reasons for “surprising phenomena” which are hard to guess without knowing a “secret code”. For instance, the success of “homomorphic encryption” essentially implies that for appropriately chosen “architectures”, surprising properties of the input-output function f can be true for a “hidden” reason such that checking or even guessing the property requires knowing a secret key. 

"Ignoring the monsters" and the heuristic arguments agenda

Thus almost certainly, any sufficiently ambitious operationalization of “everything happens for a reason” will in fact break (i.e., be false or unprovable) due to cryptography or Gödelian logical shenanigans. But people still get far with logic despite incompleteness, and also many tasks which are NP hard or incomplete are reasonable for “realistic” problems. So we can reason under some assumption of “small/shallow proofs”, or under a “builder/breaker” assumption, where we assume that the “true” reason does in fact exist and is chosen by Alice, and Bob is trying to understand as much as he can of Alice’s reason, perhaps given God-like powers of computational power or access souped-up logical system that includes e.g. the decidability problem in Alice’s logical system (having such hierarchical completeness is allowed by Gödel!). One then asks what kinds of tools Bob would use to “decrypt” as it were Alice’s problem, and this leads to a collection of tools that should in principle include both proof (“the last operation in the program is to output 0 if the Riemann hypothesis is true on some finite region, and since the Riemann hypothesis is true, this must always output 0”), but also statistical arguments (“if we conceptualize this region of code as implementing a random hash with 100-digit output, then the probability of it outputting all 0’s is .” Thus if one were to posit that some limited or big-brained operationalization of the “everything happens for a reason” property were to hold, then the “reasons” would in fact have to include both making proofs and making statistical models of “generic behaviors” of circuits. Moreover this would (in tractable contexts) pose an open problem that is distinct from basic statistics: namely,

Give “all possible reasons” (within a given complexity limit, problem-specific circuit operationalization and asymptotic regime, a suitably tuned logical and computational hierarchy that appropriately privileges the decoder over the encoder, etc.) that make a surprising input-output behavior occur. 

This would include both logic reasons (like the “Riemann hypothesis” example above, assuming this is allowed by the complexity class under consideration) but also the kind of statistical analysis from the original paper on heuristic arguments: for instance a plausible explanation for a circuit to have a bias towards outputs with more 1’s than 0’s is if the final layer of outputs that generate the output bits has more OR circuits (that a priori have a 3/4 probability of outputting 1) than AND circuits, or an analogous bias for reversible circuits; one can then iterate, allowing for the NOT of an AND circuit, and deeper circuits with a bias towards outputting 1 (by itself understanding such shallow circuits can be viewed as a logic-flavored problem, though in sufficiently “coarse” or “shallow” regimes it can be tackled by brute force). Under a suitable wideness and fully-connectedness assumption and for in an appropriate regime, it’s not hard to show that aggregating these kinds of predictions independently (ending in many shallow circuits that have a bias towards 1’s) fully explains most circuits with the relevant bias. More generally, ARC is interested in studying versions of this game where Alice’s moves to generate surprise are limited: for instance she can only control the relative number of different gates in each layer, but is required to choose a circuit randomly from such a class; or, conversely, where Bob’s explanations can be relaxed, for instance allowing “arguments” A that impose correlations between memory locations without having to specifically explain how they are produced by the circuit. If I understand correctly, David Matolsci has tested some cases where this kind of check can be performed by “brute force” sampling of a bunch of circuits, and has found that empirically, simple “surprising” properties of generic architectures do tend to admit explanations of this kind. 

In an uncharitable sense, one could posit that this still bottoms out in “usual logic”. There is no “preferred” presumption of independence: saying that “when modeling these parts of the code as random, we can deduce that this set of bits is statistically independent from this set of bits” is a mathematical statement, and must be proven on a case-by-case basis depending on the specific architecture of the circuit and specific properties of Alice’s code. In fact, at some point I sketched out a proof to David that it is simply not possible to have a version logic “plus heuristic arguments” that is distinct from ordinary logic, perhaps with additional axioms (at least under some assumptions). But again, this is a common situation to be in in mathematics and complexity theory, and people still make progress: the field of diophantine equations is logically complete, i.e. can encode any logical sentence. This means that, in a certain reductive sense, the field is indistinguishable from logic. But despite this there are proof techniques and theorems that are unquestionably useful to solve (“generic”/ “interesting”/ etc.) diophantine equations, that are very clearly in the “diophantine equations field”. Similarly, there are obviously some presumptions of independence that are “more often true” than others in the context of analysing the randomness in circuits and neural nets with “random components”.

Upshots

So where does this leave us? First of all, I was wrong. I thought that the whole idea of “combining logic with statistical modeling” in a heuristic-argument flavored format is a non-starter. I no longer think so, and understand that there are potentially interesting statements to be made in developing and testing a particular language of statistical models that include small proofs about circuits, and that some form of this is likely to be useful for understanding AI and statistics more generally. Second, I want to reiterate my point that taking this too far is unrealistic: heuristic arguments will not magically plug holes in logic and cryptography, and is unlikely to be much use in tackling the “cryptographically hard” bit of deception attacks like steganography.[3] I think this means that I should be more careful to check that I’m not missing some basic intuition before criticising an agenda (but also, I definitely feel like groping at the parts that bothered me and didn’t make sense has been productive). Third, I was wrong in particularly spectacular way – I spent 3 months of my life specifically giving myself the task of trying to understand and steel-man ARC research ideas while doing a work trial with them, and while this wasn’t the main focus of my work, I did spend significant time on this; and yet I failed to understand how they think about a quite basic part of their agenda. This is a cautionary tale for me. It might also an update on being more careful about messaging and distillation if you have a somewhat subtle agenda that’s easy to misinterpret - in particular, I think that the “twin primes” example and similar “statistical number theory” thinking in the original heuristic arguments paper and elsewhere in the analysis is to a large degree a red herring, and isn’t great as a lead-in for the link to combining logic and circuits/neural nets, especially when communicating with mathematicians (since the logic of number theory and complex mathematical results about twin primes like the famous “bounded gap conjecture proof” do not fit into these pictures in the way that heuristic arguments are expected to combine simple statistical and logical “gadgets”, and making an explicit connection that’s not a Motte-Bailey kind of sleight of hand requires a bunch of extra epicycles on top of the already somewhat sophisticated basic ideas). Finally, I am now more bullish about ARC research ideas – though I still don’t think they have, at the moment, sufficiently exciting results to defend heuristic arguments being a very productive field. This may be fixable – but something I would invest in if I were working at ARC and were to try to find good ways of developing this ideology is a de-prioritization of universalising questions in logic and cryptography in favor of more model-specific questions of “what interesting logico-statistical takeaways can we get from looking more carefully at this particular system, with these specific assumptions”. My understanding is that ARC is also moving in this direction (in particular, with analyzing certain linear programming-type models). 

Summary

So in summary: if you accept the “no-coincidence principle” that, even in some very vague sense, “surprising properties of functions are due to tractable properties of code”, then it becomes an interesting problem to try to operationalize the “tractable properties of code” involved. Doing this (i.e., getting “nontrivially complete explanations for the surprising phenomena”) even in very simple cases and under very permissive assumptions about what the “no-coincidence property” means, leads to an interesting class of questions, which I’ve roughly identified as questions about how to operationalize “heuristic arguments”. These must combine both simple logical steps and interesting architecture-specific statistical models, in ways which I expect (and, it seems, Paul Christiano and the team at ARC expects) to lead to interesting new mathematical and statistical objects. At the same time, pushing this too far or expecting too much from this picture almost inevitably leads to thorny issues with cryptography and logic that are probably not solvable; at the same time this “hardness in general” property doesn’t preclude this class of ideas from being useful in realistic contexts for interpretability and safety. Putting this all together makes me much more bullish about the agenda, though in the research I’ve seen there is a more emphasis than I’d like on logic and dancing with cryptographic monsters and not enough of an emphasis on getting interesting statistical bits of the picture and running experiments with them. Finally, an important personal update for me is that I’ve changed my mind about this, and hope to be better-calibrated in the future on the possibility of dismissing useful ideas too early.

Renormalization as a cousin of heuristic arguments

I want to end with a plug of my current research agenda with PIBBSS (in the same breath as admitting that I’m a doofus who can’t be trusted to understand basic concepts after thinking about them for months). I think a good class of systems that are correctly shaped and exhibit a lot of the interesting “fuzzy independence and atomicity” properties expected from heuristic arguments are models from statistical physics. Here physicists have a mix of tools, positioned in the exact sweet spot between rigor and heuristic experimental experience, for extracting formal independence properties of such systems, and this goes under the generalized name of “renormalization”. The physics ideas also fit nicely into deep heuristic-argument-like results about conditionalizing random neural nets via the NN-QFT correspondence and related thinking, which studies a limit of neural nets (“continuous circuits”) where Gaussianity, independence and randomness emerge in a particularly natural and elegant way. This is the class of tools we’re hoping to make use of in our “renormalization agenda [LW · GW]” that we’ve kicked off in the last few weeks.

  1. ^

    A disclaimer that Lizka, who ran the linked context, is my sister so I’m not unbiased.

  2. ^

    Thanks to Scott Aaronson for the link.

  3. ^

    At the same time, this and other statistics/interp techniques may be useful to deal with precursors of these during training or in weak models.

0 comments

Comments sorted by top scores.