Why does generalization work?

martinsq

Why does generalization work?

post by Martín Soto (martinsq) · 2024-02-20T17:51:10.424Z · LW · GW · 16 comments

  I. Physics
  II. Anthropics
  III. Dust
None
16 comments

Just an interesting philosophical argument

I. Physics

Why can an ML model learn from part of a distribution or data set, and generalize to the rest of it? Why can I learn some useful heuristics or principles in a particular context, and later apply them in other areas of my life?

The answer is obvious: because there are some underlying regularities between the parts I train on and the ones I test on. In the ML example, generalization won't work when approximating a function which is a completely random jumble of points.
Also, quantitatively, the more regular the function is, the better generalization will work. For example, polynomials of lower degree require less data points to pin down. Same goes for periodic functions. Also, a function with lower Lipschitz constant will allow for better bounding of the values in un-observed points.

So it must be that the variables we track (the ones we try to predict or control, either with data science or our actions), are given by disproportionately regular functions (relative to random ones). In this paper by Tegmark, the authors argue exactly that most macroscopic variables of interest have Hamiltonians of low polynomial degree. And that this happens because of some underlying principles of low-level physics, like locality, symmetry, or the hierarchical composition of physical processes.

But then, why is low-level physics like that?

II. Anthropics

If our low-level physics wasn't conducive to creating macroscopic patterns and regularities, then complex systems capable of asking that question (like ourselves) wouldn't exist. Indeed, we ourselves are nothing more than a specific kind of macroscopic pattern. So anthropics explains why we should expect such patterns to exist, similarly to how it explains why the gravitational constant, or the ratio between sound and light speed, are the right ones to allow for complex life.

III. Dust

But there's yet one more step.

Let's try to imagine a universe which is not conducive to such macroscopic patterns. Say you show me its generating code (its laws of physics), and run it. To me, it looks like a completely random mess. I am not able to differentiate any structural regularities that could be akin to the law of ideal gases, or the construction of molecules or cells. While on the contrary, if you showed me the running code of this reality, I'd be able (certainly after many efforts) to differentiate these conserved quantities and recurring structures.

What are, exactly, these macroscopic variables I'm able to track, like "pressure in a room", or "chemical energy in a cell"? Intuitively, they are a way to classify all possible physical arrangements into more coarse-grained buckets. In the language of statistical physics, we'd say they are a way to classify all possible microstates into a macrostate partition. For example, every possible numerical value for pressure is a different macrostate (a different bucket), that could be instantiated by many different microstates (exact positions of particles).

But there's a circularity problem. When we say a certain macroscopic variable (like pressure) is easily derived from others (like temperature), or that it is a useful way to track another variable we care about (like "whether a human can survive in this room"), we're being circular. Given I already have access to a certain macrostate partition (temperature), or that I already care about tracking a certain macrostate partition (aliveness of human), then I can say it is natural or privileged to track another partition (pressure). But I cannot motivate the importance of pressure as a macroscopic variable from just looking at the microstates.

Thus, "which parts of physics I consider interesting macroscopic variables to track" is observer-dependent [AF · GW] in the first place. In fact, this point is already argued in a different context (the renormalization group) in the Tegmark paper.

So when I observe a universe running and fail to find any relevant macroscopic regularities, maybe I'm just failing to find the kinds of patterns I'm used to processing, but a different observer would find a fountain of informative patterns there.

It becomes tautological (even without the above application of anthropics) that a macroscopic pattern like me will find certain other macroscopic patterns to track: those that are expressed in the same "partitional language over the substrate of low-level physics" as myself. Of course my impression will be that regularity exists in the world: I am more prone to notice or care about exactly those patterns which I find regular.

For example, "pressure" is something I can easily track, and also something that can easily kill me, due to the same underlying reason: because the partition in which I, as a time-persistent pattern, am written, and that in which pressure is written, share a lot of information.
Similarly, the fact that "the exact position of an electron" is hard to track for me doesn't worry me too much, nor makes me deem this world irregular, because exactly by the same reason that I cannot easily track that partition, that partition is also very unlikely to hurt me or alter the variables I care about (the position of an electron marking the difference between a human being alive and dead is extremely unlikely).

Given this circularity, might not different partitional languages coexist over a same substrate low-level physics? Is it not possible that, inside that randomly-seeming universe, an infinitude of macroscopic patterns exist (analogous to us), complex enough to track the similar macroscopic patterns they care about (analogous to our pressure), but written in a partitional language we don't discern? Might it not be that all random-looking universes are as full of complex macroscopic patterns as ours, but different observers are able to decode different regularities?

This sounds a lot like dust theory, although it's an angle on it I hadn't seen sketched:
You, as a specific self-preserving macrostate partition, pick out similar macrostate partitions (that share information with you). And this is always possible, no matter which macrostate partition you are (even if the resulting partitions look, to our partition-laden eyes, more chaotic).

An equivalent way to put this, going back to looking at the computational trail of a random universe, is that depending on how you represent it I will or will not find legible patterns. Indeed, even for our universe, you can obfuscate its code in such a way (for example, shuffle spatial positions around) that makes it extremely difficult (if not impossible) for me to find regularities like molecules, or even notice locality. While, with the right de-shuffle, I'd be able to notice the code is running our universe. Of course, this is just the old problem of taking functionalism to its logical conclusion. And that's indeed what dust theory is all about.

16 comments

Comments sorted by top scores.

comment by aysja · 2024-02-21T00:26:24.179Z · LW(p) · GW(p)

I think dust theory is wrong in the most permissive sense: there are physical constraints on what computations (and abstractions) can be like. The most obvious one is "things that are not in each others lightcone can't interact" and interaction is necessary for computation (setting aside acausal trades and stuff, which I think are still causal in a relevant sense, but don't want to get into rn). But there are also things like: information degrades over distance (roughly the number of interactions, i.e., telephone theorem) and so you'd expect "large" computations to take a certain shape, i.e., to have a structure which supports this long-range communication such as e.g., wires.

More than that, though, I think if you disrespect the natural ordering of the environment you end up paying thermodynamic costs. Like, if you take the spectrum of visible light, ordered from ~400 to 800 nm and you just randomly pick wavelengths and assign them to colors arbitrarily (e.g., "red" is wavelengths 505, 780, 402, etc.), then you have to pay more cost to encode the color. Because, imo, the whole point of abstractions is that they're strategically imprecise. I don't have to model the exact wavelengths of the color red, it's whatever is in the range ~600-800, and I can rely on averages to encode that well enough. But if red is wavelengths 505, 780, 402, etc., now averages won't help, and I need to make more precise measurements. Precision is costly: it uses more bits, and bits have physical cost (e.g., Landauer's limit).

I guess you could argue that someone else might go and see the light spectrum differently, i.e., what looks like wavelengths 505 vs 780 to us looks like wavelengths 505 vs 506 to them? But without a particular reason to think so it seems like a general purpose counterargument to me. You could always say that someone would see it differently—but why would they?

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-03-12T12:40:45.590Z · LW(p) · GW(p)

A footnote to information degrades over distance that you might be interested in:

Usually long-range correlations are small ('information degrades over distance'), both over distance and scale. But not always. In very special situations long-range correlations can be large both over distance and over scale. I.e. the proverbial butterfly wingclap that causes a hurricane at the other side of the world.

in solid-state physics, condensed matter and a number of other fields people are interested in phase transitions. During phase transitions long-range correlations can become very large.

There is some fancy math going under monickers like 'conformal field theory, virasoro algebra' iirc. I know nothing about this but @Daniel Murfet [LW · GW] might be able to say more.

comment by Kaarel (kh) · 2024-02-22T00:12:22.541Z · LW(p) · GW(p)

I'd be very interested in a concrete construction of a (mathematical) universe in which, in some reasonable sense that remains to be made precise, two 'orthogonal pattern-universes' (preferably each containing 'agents' or 'sophisticated computational systems') live on 'the same fundamental substrate'. One of the many reasons I'm struggling to make this precise is that I want there to be some condition which meaningfully rules out trivial constructions in which the low-level specification of such a universe can be decomposed into a pair such that $s_{1}$ and $s_{2}$ are 'independent', everything in the first pattern-universe is a function only of $s_{1}$ , and everything in the second pattern-universe is a function only of $s_{2}$ . (Of course, I'd also be happy with an explanation why this is a bad question :).)

Replies from: martinsq

↑ comment by Martín Soto (martinsq) · 2024-03-06T20:23:52.749Z · LW(p) · GW(p)

I think that's the right next question!

The way I was thinking about it, the mathematical toy model would literally have the structure of microstates and macrostates. What we need is a set of (lawfully, deterministically) evolving microstates in which certain macrostate partitions (macroscopic regularities, like pressure) are statistically maintained throughout the evolution. And then, for my point, we'd need two different macrostate partitions (or sets of macrostate partitions) such that each one is statistically preserved. That is, complex macroscopic patterns it self-replicate (a human tends to stay in the macrostate partition of "the human being alive"). And they are mostly independent (humans can't easily learn about the completely different partition, otherwise they'd already be in the same partition).

In the direction of "not making it trivial", I think there's an irresolvable tension. If by "not making it trivial" you mean "s1 and s2 don't obviously look independent to us", then we can get this, but it's pretty arbitrary. I think the true name of "whether s1 and s2 are independent" is "statistical mutual information (of the macrostates)". And then, them being independent is exactly what we're searching for. That is, it wouldn't make sense to ask for "independent pattern-universes coexisting on the same substrate", while at the same time for "the pattern-universes (macrostate partitions) not to be truly independent".

I think this successfully captures the fact that my point/realization is, at its heart, trivial. And still, possibly deconfusing about the observer-dependence of world-modelling.

comment by gwern · 2024-02-20T23:49:29.766Z · LW(p) · GW(p)

This sounds a lot like dust theory, although it’s an angle on it I hadn’t seen sketched: You, as a specific self-preserving macrostate partition, pick out similar macrostate partitions (that share information with you). And this is always possible, no matter which macrostate partition you are (even if the resulting partitions look, to our partition-laden eyes, more chaotic).

This sounds a bit like Wolfram's grander 'ruliad' multiverse, although I'm not sure if he claims that every ruliad must have observers, no matter how random-seeming... It seems like a stretch to say that. After all, while you could always create a mapping between random sequences of states and an observer, by a giant lookup table if nothing else, this mapping might exceed the size of the universe and be impossible, or something like that. I wonder if one could make a Ramsey-theory-style argument for more modest claims?

Replies from: hwold, martinsq

↑ comment by hwold · 2024-02-21T09:58:48.849Z · LW(p) · GW(p)

although I'm not sure if he claims that every ruliad must have observers

Of course yes, since there's only one ruliad by definition, and we’re observers living inside it.

In Wolfram terms I think the question would more be like : "does every slice in rulial space (or every rulial reference frame) has an observer ?"

Possibly of interest : https://writings.stephenwolfram.com/2023/12/observer-theory/

One part that I don’t see as sufficiently emphasized is the "as a time-persistent pattern" part. It seems to me that that part is bringing with it a lot of constraints on what partition languages yield time-persistent patterns.

↑ comment by Martín Soto (martinsq) · 2024-02-21T06:16:14.716Z · LW(p) · GW(p)

Didn't know about ruliad, thanks!

I think a central point here is that "what counts as an observer (an agent)" is observer-dependent (more here [LW · GW]) (even if under our particular laws of physics there are some pressures towards agents having a certain shape, etc., more here [LW · GW]). And then it's immediate each ruliad has an agent (for the right observer) (or similarly, for a certain decryption of it).

I'm not yet convinced "the mapping function/decryption might be so complex it doesn't fit our universe" is relevant. If you want to philosophically defend "functionalism with functions up to complexity C" instead of "functionalism", you can, but C starts seeming arbitrary?

Also, a Ramsey-theory argument would be very cool.

comment by Viliam · 2024-02-20T20:46:40.791Z · LW(p) · GW(p)

Is it not possible that, inside that randomly-seeming universe, an infinitude of macroscopic patterns exist (analogous to us), complex enough to track the similar macroscopic patterns they care about (analogous to our pressure), but written in a partitional language we don't discern?

Is homomorphic encryption an example of the thing you are talking about?

Replies from: martinsq

↑ comment by Martín Soto (martinsq) · 2024-02-20T21:29:14.710Z · LW(p) · GW(p)

Yep! Although I think the philosophical point goes deeper. The algorithm our brains themselves use to find a pattern is part of the picture. It is a kind of "fixed (de/)encryption".

comment by cubefox · 2024-02-24T23:47:20.084Z · LW(p) · GW(p)

Eliezer has a defense of there being "objectively correct" macrostates / categories. See Mutual Information, and Density in Thingspace [? · GW]. He concludes:

And the way to carve reality at its joints, is to draw your boundaries around concentrations of unusually high probability density in Thingspace.

The open problems with this approach seem to be that it requires some objective notion of probability, and that there is an objectively preferred way of defining "thingspace". (Regarding the latter, I guess the dimensions of thingspace should fulfill some statistical properties, like being as probabilistically independent as possible, or something like that.) Otherwise everyone could have their own subjective probability and their own subjective thingspace, and "carving reality at its joints" wouldn't be possible.

But it seems to me that techniques from unsupervised / self-supervised learning do suggest that there are indeed some statistical features that allow for some objectively superior clustering of data.

Replies from: martinsq

↑ comment by Martín Soto (martinsq) · 2024-03-06T20:12:40.783Z · LW(p) · GW(p)

My post is consistent with what Eliezer says there. My post would simply remark:
You are already taking for granted a certain low-level / atomic set of variables = macro-states (like mortal, featherless, biped). Let me bring to your attention that you pay attention to these variables because they are written in a macro-state partition similar / useful to your own. It is conceivable for some external observer to look at low-level physics, and interpret it through different atomic macro-states (different from mortal, featherless, biped).

The same applies to unsupervised learning. It's not surprising that macro-states expressed in a certain language (the computation methods we've built to find simple regularities in certain sets of macroscopic variables). As before, there simply are just already some macro-state partitions we pay attention to, in which these macroscopic variables are expressed (but not others like "the exact position of a particle"), and also in which we build our tools (similarly to how our sensory perceptors are also built in them).

Replies from: cubefox

↑ comment by cubefox · 2024-03-06T21:16:19.838Z · LW(p) · GW(p)

As I said, he assumes there is some objectively correct way to define the "thingspace" and a probability distribution on it. Should this rather strong assumption hold, his argument seems plausible that categories (like "mortal") should, and presumably usually do, correspond to clusters of high probability density.

(By the way, macrostates, or at least categories, don't generally form a partition, because something can be both mortal and a biped.)

So I don't think he takes certain categories for granted, but rather the existence an objective thingspace and probability distribution which in turn would enable objective categories. But he doesn't argue for it (except very tangentially in a comment) so you may well doubt such an objective background exists.

I think some small ground to believe his theory is right is that most intuitively natural categories seem to be also objectively better than others, in the sense that they form, or have in the past formed, projectible predicates:

A property of predicates, measuring the degree to which past instances can be taken to be guides to future ones. The fact that all the cows I have observed have been four-legged may be a reasonable basis from which to predict that future cows will be four-legged. This means that four-leggedness is a projectible predicate. The fact that they have all been living in the late 20th or early 21st century is not a reasonable basis for predicting that future cows will be. See also entrenchment, Goodman's paradox.

Projectibility seems to me itself a rather objective statistical category.

comment by Gordon Seidoh Worley (gworley) · 2024-02-21T17:45:47.065Z · LW(p) · GW(p)

Stepping back from the physical questions, we can also wonder why generalization works in general without reference to a particular physical model of the world. And we find, much like you have, that generalization works because it is contingent on the person doing the generalizing.

In philosophy we talk about this through the problem of induction, which arises because the three standard options for justifying its validity are unsatisfactory: assuming it is valid as a matter of dogma, proving it is a valid method of finding the truth (which bumps into the problem of the criterion [? · GW]), or proving its validity recursively (i.e. induction works because it's worked in the past).

One of the standard approaches is to start from what would be the recursive justification and ground out the recursion by making additional claims, and a commonly needed claim is known as the uniformity principal, which says roughly that we should expect future evidence to resemble past evidence (in Bayesian terms we might phrase this as future and past evidence drawing from the same distribution). But the challenge then becomes to justify the uniformity principal, and it leads down the same path you've explored here in your post, finding that ultimately we can't really justify it except if we privilege our personal experiences of finding that each new moment seems to resemble the past moments we can recall.

This ends up being the practical means by which we are able to justify induction (i.e. it seems to work when we've tried it), but also does nothing to guarantee it would work in another universe or even outside our Hubble volume.

comment by numpyNaN · 2024-02-26T12:00:39.278Z · LW(p) · GW(p)

In the ML example, generalization won't work when approximating a function which is a completely random jumble of points.

Nice article, minor question. You seem to be treating random functions as qualitatively different from regular/some-flavor-of-deterministic ones (please correct if not the case). Other than in mathematical settings, I'm not sure how that works, since you would expect some random noise in (or on top of) the data you are recording (and feeding your model), and that same noise would contaminate that determinism.

Also, when approximating a completely random jumble of points, can't you build models to infer the distribution from where those points are taken? I get it wont be as accurate when predicting but I fail to see why that's not an issue of degrees.

Replies from: martinsq

↑ comment by Martín Soto (martinsq) · 2024-03-06T19:58:14.143Z · LW(p) · GW(p)

By random I just meant "no simple underlying regularity explains it shortly". For example, a low-degree polynomial has a very short description length. While a random jumble of points doesn't (you need to write the points one by one). This of course already assumes a language.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-02-20T23:12:40.103Z · LW(p) · GW(p)

Yes. The ultimate theory of physics will be agent-centric. It never bottoms out in low-level physics, simulation of simulations all the way down. Dude.

Why does generalization work?

Contents

I. Physics

II. Anthropics

III. Dust

16 comments