How should we model complex systems?

post by ClimateDoc (OxDoc) · 2020-04-12T21:23:30.893Z · LW · GW · 37 comments

This is a question post.

Contents

  Answers
    5 Kenny
None
37 comments

By "complex", I mean a system for which it would be too computationally costly to model it from first principles e.g. the economy, the climate (my field, by the way). Suppose our goal is to predict a system's future behaviour with minimum possible error given by some metric (e.g. minimise the mean square error or maximise the likelihood). This seems like something we would want to do in an optimal way, and also something a superintelligence should have a strategy to do, so I thought I'd ask here if anyone has worked on this problem.

I've read quite a bit about how we can optimally try to deduce the truth e.g. apply Bayes' theorem with a prior set following Ockham's razor (c.f. Solomonoff induction). However, this seems difficult to me to apply to modelling complex systems, even as an idealisation, because:

  1. Since we cannot afford to model the true equations, every member of the set of models available to us is false, so the likelihood and posterior probability for each will typically evaluate to zero given enough observed data. So if we want to use Bayes' theorem, the probabilities should not mean the probability of each model being true. But it's not clear to me what they should mean - perhaps the probability that each model will give the prediction with the lowest error? But then it's not clear how to do updating, if the normal likelihoods will typically be zero.

  2. It doesn't seem clear that Ockham's razor will be a good guide to giving our models prior probabilities. Its use seems to be motivated by it working well for deducing fundamental laws of nature. However, for modelling complex systems it seems more reasonable to me to give more weight to models that incorporate what we understand to be the important processes - and past observations can't necessarily help us tell what processes are important to include, because different processes may become important in future (c.f. biological feedbacks that may kick in as the climate warms). This could perhaps be done by having a strategy for deriving approximate affordable models from the fundamental laws - but is it possible to say anything about how an agent should do this?

I've not found anything about rational strategies to approximately model complex systems rather than derive true models. Thank you very much for any thoughts and resources you can share.

Answers

answer by Kenny · 2020-04-13T02:09:43.237Z · LW(p) · GW(p)

I think it's an open question whether we can generally model complex systems at all – at least in the sense of being able to make precise predictions about the detailed state of entire complex systems.

But there's still ways to make progress at modeling and predicting aspects of complex systems, e.g. aggregate info, dynamics, possible general states.

The detailed behavior of a macroscopic quantity of individual molecules is complex and impossible to predict in detail at the level of individual molecules, but we can accurately predict some things for some of these systems: the overall temperature, the relative quantities of different types of molecules, etc.

Some potentially complex systems exhibit behavior that is globally, or at some level, 'simple' in some way, i.e. relatively static or repetitive, nested, or random. This is where simple mathematical or statistic modeling works best.

Statistical mechanics and chemistry are good examples of this.

The hardest complex systems to model involve, at some level, an interplay of repetitive and random behavior. This often involves 'structures' whose individual history affects the global state of the system on long-enough timescales. Sometimes the only way to precisely predict the future of the detailed state of these kinds of systems is to simulate them exactly.

Biology, economics, and climatology are good examples of subjects that study these kinds of systems.

For the most complex systems, often the best we can do is predict the possible or probable presence of kinds or categories of dynamics or patterns of behavior. In essence, we don't try to model an entire individual complex system as a whole, in detail, but focus on modeling parts of a more general class of those kinds of systems.

This can be thought of as 'bottom-up' modeling. Some examples: modeling senescence, bank runs, or regional climate cycles.

I've not found anything about rational strategies to approximately model complex systems rather than derive true models.

I interpret "approximately model complex systems" as 'top-down' 'statistical' modeling – that can be useful regardless, even if it's wrong, but might be reasonably accurate if the system is relatively 'simple'. But if the system is complex to the 'worst' degree, then we need to "derive true models" for at least some parts of the system and approximately model the global system using something like a 'hierarchical' model built from 'smaller' models.

In full generality, answering this question demands a complete epistemology and decision theory!

For 'simple' complex systems, we may be able to predict their future fairly accurately. For the most complex systems, often we can only wait to discover their future states – in detail – but we may be able to predict some subset of the overall system (in time and 'space') in the interim.

comment by ClimateDoc (OxDoc) · 2020-04-13T10:13:54.544Z · LW(p) · GW(p)

Thanks for your reply. (I repeat my apology from below for not apparently being able to use formatting options in my browser in this.)

"I think it's an open question whether we can generally model complex systems at all – at least in the sense of being able to make precise predictions about the detailed state of entire complex systems."

I agree modelling the detailed state is perhaps not possible. However, there are at least some complex systems we can model and get substantial positive skill at predicting particular variables without needing to model all the details e.g. the weather, for particular variables up to a particular amount of time ahead, and predictions of global mean warming made from the 1980s seem to have validated quite well so far (for decadal averages). So human minds seem to succeed at least sometimes, but without seeming to follow a particular algorithm. Presumably it's possible to do better, so my question is essentially how would an algorithm that could do better look?

I agree that statistical mechanics is one useful set of methods. But, thinking of the area of climate model development that I know something about, statistical averaging of fluid mechanics does form the backbone to modelling the atmosphere and oceans, but adding representations of processes that are missed by that has added a lot of value (e.g. tropical thunderstorms that are well below the spacing of the numerical grid over which the fluid mechanics equations were averaged). So there seems to be something additional to averaging that can be used, to do with coming up with simplified models of processes you can see are missed out by the averaging. It would be nice to have an algorithm for that, but that's probably asking for a lot...

"I interpret "approximately model complex systems" as 'top-down' 'statistical' modeling"

I didn't mean this to be top-down rather than bottom-up - it could follow whatever modelling strategy is determined to be optimal.

"answering this question demands a complete epistemology and decision theory!"

That's what I was worried about... (though, is decision theory relevant when we just want to predict a given system and maximise a pre-specified skill metric?)

Replies from: Kenny, habryka4
comment by Kenny · 2020-04-13T18:19:56.958Z · LW(p) · GW(p)

(I'm not sure if there are formatting options anymore in the site UI – formatting is (or can be) done via Markdown syntax. In your user settings, there's a "Activate Markdown Editor" option that you might want to test changing if you don't want to use Markdown directly.)

So human minds seem to succeed at least sometimes, but without seeming to follow a particular algorithm. Presumably it's possible to do better, so my question is essentially how would an algorithm that could do better look?

I think 'algorithm' is an imprecise term for this discussion. I don't think there are any algorithms similar to a prototypical example of a computational 'algorithm' that could possibly do a better job, in general, than human minds. In the 'expansive' sense of 'algorithm', an AGI could possibly do better, but we don't know how to build those yet!

statistical averaging of fluid mechanics does form the backbone to modelling the atmosphere and oceans, but adding representations of processes that are missed by that has added a lot of value (e.g. tropical thunderstorms that are well below the spacing of the numerical grid over which the fluid mechanics equations were averaged). So there seems to be something additional to averaging that can be used, to do with coming up with simplified models of processes you can see are missed out by the averaging. It would be nice to have an algorithm for that, but that's probably asking for a lot...

There might be algorithms that could indicate whether, or how likely, it is that a model is 'missing' something, solving that problem generally would require access to the 'target' system like we have (i.e. by almost entirely living inside of it). If you think about something like using an (AI) 'learning' algorithm, you wouldn't expect it to be able to learn about aspects of the system that aren't provided to it as input. But how could we feasibly, or even in principle, provide the Earth's climate as input, i.e. what would we measure (and how would we do it)?

I interpret "approximately model complex systems" as 'top-down' 'statistical' modeling

I didn't mean this to be top-down rather than bottom-up - it could follow whatever modelling strategy is determined to be optimal.

What I was sketching was something like how we currently model complex systems. It can be very helpful to model a system top-down, e.g. statistically, by focusing on relatively simple global attributes. The inputs for fluid mechanics models of the climate are an example of that. Running those models is a mix of top-down and bottom-up. The model details are generated top-down, but studying the dynamics of those models in detail is more bottom-up.

answering this question demands a complete epistemology and decision theory!

That's what I was worried about... (though, is decision theory relevant when we just want to predict a given system and maximise a pre-specified skill metric?)

Any algorithm is in effect a decision theory. A general algorithm for modeling arbitrary complex systems would effectively make a vast number of decisions. I suspect finding or building a feasible and profitable algorithm like this will also effectively require "a complete epistemology and decision theory".

We already have a lot of fantastically effective tools for creating top-down models.

But when the best of the top-down models we can make aren't good enough, we might need to consider incorporating elements from bottom-up models that aren't already included in what we're measuring, and trying to predict, at the top-down level, e.g. including cyclones in a coarse fluid mechanics model. Note that cyclones are a perfect example of what I referred to as:

'structures' whose individual history affects the global state of the system.

We need good decision theories to know when to search for more or better bottom-up models. What are we missing? How should we search? (When should we give up?)

The name for 'algorithms' (in the expansive sense) that can do what you're asking is 'general intelligence'. But we're still working on understanding them!

Concretely, the fundamental problem with developing a general algorithm to "approximately model complex systems" is acquiring the necessary data to feed the algorithm as input. What's the minimum amount and type of data that we need to approximately model the Earth's climate (well enough)? If we don't already have that data, how can the general algorithm acquire it? In general, only a general intelligence that can act as an agent is capable of doing that (e.g. deciding what new measurements to perform and then actually performing them).

A vague and imprecise sketch of a general algorithm might be:

  1. Measure the complex system somehow.
  2. Use some 'learning algorithm' to generate an "approximate" model.
  3. Is the model produced in [2] good enough? Yes? Profit! (And stop executing this algorithm.) No? Continue.
  4. Try some combination of different ways to measure the system [1] and different learning algorithms to generate models [2].

Note that steps [1], [2], and [4] are, to varying degrees, bottom-up modeling. [4] tho also incorporates a heavy 'top-down' perspective, e.g. determining/estimating/guessing what is missing from that perspective.

[1] might involve modeling how well different 'levels' of the actual complex system can be modeled. Discovering 'structures' in some level is good evidence that additional measurements may be required to model levels above that.

[2] might involve discovering or developing new mathematical or computational theories and algorithms, i.e. info about systems in general.

Replies from: OxDoc
comment by ClimateDoc (OxDoc) · 2020-04-13T20:04:38.413Z · LW(p) · GW(p)

Thanks again. OK I'll try using MarkDown...

I think 'algorithm' is an imprecise term for this discussion.

Perhaps I used the term imprecisely - I basically meant it in a very general sense of being some process, set of rules etc. that a computer or other agent could follow to achieve the goal.

We need good decision theories to know when to search for more or better bottom-up models. What are we missing? How should we search? (When should we give up?)

The name for 'algorithms' (in the expansive sense) that can do what you're asking is 'general intelligence'. But we're still working on understanding them!

Yes I see the relevance of decision theories there and that solving this well would be requiring a lot of what would be needed for AGI. I guess when I originally asked, I was wondering if there might have been some insights people had worked out on the way to that - just any parts of such an algorithm that people have figured out, or that at least would reduce the error of a typical scientist. But maybe that will be another while yet...

I think you're right that such an algorithm would need to make measurements of the real system, or systems with properties matching component parts (e.g. a tank of air for climate), and have some way to identify the best measurements to make. I guess determining whether there is some important effect that's not been accounted for yet would require a certain amount of random experimentation to be done (e.g. for climate, heating up patches of land and tanks of ocean water by a few degrees and seeing what happens to the ecology, just as we might do).

This is not necessarily impractical for something like atmospheric or oceanic modelling, where we can run trustworthy high-resolution models over small spatial regions and get data on how things change with different boundary conditions, so we can tell how the coarse models should behave. So then criteria for deciding where and when to run these simulations would be needed. Regions where errors compared to Earth observations are large and regions that exhibit relatively large changes with global warming could be a high priority. I'd have to think if there could be a sensible systematic way of doing it - I guess it would require an estimate of how much the metric of future prediction skill would decrease with information gained from a particular experiment, which could perhaps be approximated using the sensitivity of the future prediction to the estimated error or uncertainty in predictions of a particular variable. I'd need to think about that more.

Replies from: Kenny
comment by Kenny · 2020-04-14T00:10:17.788Z · LW(p) · GW(p)

I was wondering if there might have been some insights people had worked out on the way to that - just any parts of such an algorithm that people have figured out, or that at least would reduce the error of a typical scientist.

There are some pretty general learning algorithms, and even 'meta-learning' algorithms in the form of tools that attempt to more or less automatically discover the best model (among some number of possibilities). Machine learning hyper-parameter optimization is an example in that direction.

My outside view is that a lot of scientists should focus on running better experiments. According to a possibly apocryphal story told by Richard Feynman in a commencement address, one researcher discovered (at least some of) the controls one had to employ to be able to effectively study mice running mazes. Unfortunately, no one else bothered to employ those controls (let alone look for others)! Similarly, a lot of scientific studies or experiments are simply too small to produce even reliable statistical info. There's probably a lot of such low hanging fruit available. Tho note that this is often a 'bottom-up' contribution for 'modeling' a larger complex system.

But as you demonstrate in your last two paragraphs, searching for a better 'ontology' for your models, e.g. deciding what else to measure, or what to measure instead, is a seemingly open-ended amount of work! There probably isn't a way to avoid having to think about it more (beyond making other kinds of things that can think for us); until you find an ontology that's 'good enough' anyways. Regardless, we're very far from being able to avoid even small amounts of this kind of work.

comment by habryka (habryka4) · 2020-04-13T21:28:09.232Z · LW(p) · GW(p)

[Meta] Curious what browser you are using, so I can figure out whether anyone else has this problem.

Replies from: OxDoc
comment by ClimateDoc (OxDoc) · 2020-04-14T20:38:37.903Z · LW(p) · GW(p)

I'm using Chrome 80.0.3987.163 in Mac OSX 10.14.6. But I also tried it in Firefox and didn't get formatting options. But maybe I'm just doing the wrong thing...

Replies from: habryka4
comment by habryka (habryka4) · 2020-04-14T20:51:56.225Z · LW(p) · GW(p)

You do currently have the markdown editor activated, which gets rid of all formatting options, so you not getting it right now wouldn't surprise me. But you should have gotten them before you activated the markdown editor.

Replies from: OxDoc
comment by ClimateDoc (OxDoc) · 2020-04-14T20:56:48.522Z · LW(p) · GW(p)

Yes I'd selected that because I thought it might get it to work. And now I've unselected it, it seems to be working. It's possible this was a glitch somewhere or me just being dumb before I guess.

Replies from: habryka4
comment by habryka (habryka4) · 2020-04-14T21:05:51.762Z · LW(p) · GW(p)

Huh, okay. Sorry for the weird experience! 

37 comments

Comments sorted by top scores.

comment by habryka (habryka4) · 2020-04-12T21:31:59.988Z · LW(p) · GW(p)

This is a pretty good question, but there is a general norm on LessWrong to only use the word "rational" when it really can't be avoided. See: 

Only say 'rational' when you can't eliminate the word [LW · GW]

Replies from: OxDoc
comment by ClimateDoc (OxDoc) · 2020-04-13T08:40:27.584Z · LW(p) · GW(p)

OK, I made some edits. I left the "rational" in the last paragraph because it seemed to me to be the best word to use there.

comment by Yandong Zhang (yandong-zhang) · 2020-04-14T00:37:00.645Z · LW(p) · GW(p)

A lesson from last 30 years AI development: data and computation power are the key factor of improvement.

Thus, IMPHO,,for obtaining a better model, the most reliable approach is to get more data.

Replies from: johnswentworth, Pattern, Kenny
comment by johnswentworth · 2020-04-13T16:44:54.230Z · LW(p) · GW(p)

I'd push back on this pretty strongly: data and computation power, devoid of principled modelling, have historically failed very badly to make forward-looking predictions, especially in economics. That was exactly the topic of the famous Lucas critique. The main problem is causality: brute-force models usually just learn distributions, so they completely fail when distributions shift.

Replies from: yandong-zhang
comment by Yandong Zhang (yandong-zhang) · 2020-04-15T15:50:18.693Z · LW(p) · GW(p)

If a researcher was given 1000X more data, 1000X CPU power, would he switch to a brute-force approach? I did not see the connection between "data and computation power" and the brute-force models.

Replies from: johnswentworth
comment by johnswentworth · 2020-04-15T16:25:56.507Z · LW(p) · GW(p)

A simple toy model: a roll a pair of dice many, many times. If we have a sufficiently large amount of data and computational power, then we can brute-force fit the distribution of outcomes - i.e. we can count how many times each pair of numbers is rolled, estimate the distribution of outcomes based solely on that, and get a very good fit to the distribution.

By contrast, if we have only a small amount of data/compute, we need to be more efficient in order to get a good estimate of the distribution. We need a prior which accounts for the fact that there are two dice whose outcomes are probably roughly independent, or that the dice are probably roughly symmetric. Leveraging that model structure is more work for the programmer - we need to code that structure into the model, and check that it's correct, and so forth - but it lets us get good results with less data/compute.

So naturally, given more data/compute, people will avoid that extra modelling/programming work and lean towards more brute-force models - especially if they're just measuring success by fit to their data.

But then, the distribution shifts - maybe one of the dice is swapped out for a weighted die. Because our brute force model has no internal structure, it doesn't have a way to re-use its information. It doesn't have a model of "two dice", it just has a model of "distribution of outcomes" - there's no notion of some outcomes corresponding to the same face on one of the two dice. But the more principled model does have that internal structure, so it can naturally re-use the still-valid subcomponents of the model when one subcomponent changes.

Conversely, additional data/compute doesn't really help us make our models more principled - that's mainly a problem of modelling/programming which currently needs to be handled by humans. To the extent that generalizability is the limiting factor to usefulness of models, additional data/compute alone doesn't help much - and indeed, despite the flagship applications in vision and language, most of today's brute-force-ish deep learning models do generalize very poorly.

comment by Pattern · 2020-04-13T14:59:44.548Z · LW(p) · GW(p)

This would make a good answer.

comment by Kenny · 2020-04-13T18:41:35.157Z · LW(p) · GW(p)

A lot of AI development has been in relatively 'toy' domains – compared to modeling the Earth's climate!

Sometimes what is needed beyond just more data (of the same type) is a different type of data.

comment by johnswentworth · 2020-04-13T00:48:29.530Z · LW(p) · GW(p)

It is rarely too difficult to specify the true model (or a space of models containing the true model). What's hard is updating on less-than-fully-informative evidence or, in some cases, even computing what the true model predicts at all (i.e. likelihoods). So when we say it is "too costly to model from first principles", we should keep in mind that we don't mean the true model space can't even be written down efficiently. In particular, this means that "every member of the set of models available to us is false" need not hold. Similarly, Bayesian probability and Ockham's razor and whatnot can still apply, but we need efficient approximations.

(Side note: "different processes may become important in future" is not actually a problem for Ockham's razor per se. That's a problem for causal models [LW · GW], and Bayesian probability + Ockham's razor are quite capable of learning causal models.)

(Another side note: likelihoods are never actually zero, they're just very small. But likelihoods are very small for any large amount of data anyway, so there's nothing unusual about that; a model space which doesn't contain the true model isn't really a problem from that perspective.)

If we want to attack these sorts of problems rigorously from first principles, then the central challenge is to find rigorous approximations of the true underlying models. The main field I know of which studies this sort of problem directly is statistical mechanics, and a number of reasonably-general-purpose tools exist in that field which could potentially be applied in other areas (e.g. this [LW · GW]). Actually developing those applications, however, is an area of active research.

That said... when I look at the history of failure of "statistical", non-first-principles models in various fields (especially economics), it looks like they mainly fail because they don't handle causality properly. That makes sense - the theory of causality is a relatively recent development, so of course 20th-century stats people built models which failed to handle it. Armed with modern tools, it's entirely plausible that we can handle causality well without having to ground everything in first-principles.

Replies from: OxDoc, Kenny
comment by ClimateDoc (OxDoc) · 2020-04-13T09:35:45.007Z · LW(p) · GW(p)

Thanks for your detailed reply. (And sorry I couldn't format the below well - I don't seem to get any formatting options in my browser.)

"It is rarely too difficult to specify the true model...this means that "every member of the set of models available to us is false" need not hold"

I agree we could find a true model to explain the economy, climate etc. (presumably the theory of everything in physics). But we don't have the computational power to make predictions of such systems with that model - so my question is about how should we make predictions when the true model is not practically applicable? By "the set of models available to us", I meant the models we could actually afford to make predictions with. If the true model is not in that set, then it seems to be that all of these models must be false.

'"different processes may become important in future" is not actually a problem for Ockham's razor per se. That's a problem for causal models'

To take the climate example, say scientists had figured out that there were a biological feedback that kicks in once global warming has gone past 2C (e.g. bacteria become more efficient at decomposing soil and releasing CO2). Suppose you have one model that includes a representation of that feedback (e.g. as a subprocess) and one that does not but is equivalent in every other way (e.g. is coded like the first model but lacks the subprocess). Then isn't the second model simpler according to metrics like the minimum description length, so that it would be weighted higher if we penalised models using such metrics? But this seems the wrong thing to do, if we think the first model is more likely to give a good prediction.

Now the thought that occurred to me when writing that is that the data the scientists used to deduce the existence of the feedback ought to be accounted for by the models that are used, and this would give low posterior weight to models that don't include the feedback. But doing this in practice seems hard. Also, it's not clear to me if there would be a way to tell between models that represent the process but don't connect it properly to predicting the climate e.g. they have a subprocess that says more CO2 is produced by bacteria at warming higher than 2C, but then don't actually add this CO2 to the atmosphere, or something.

"likelihoods are never actually zero, they're just very small"

If our models were deterministic, then if they were not true, wouldn't it be impossible for them to produce the observed data exactly, so that the likelihood of the data given any of those models would be zero? (Unless there was more than one process that could give rise to the same data, which seems unlikely in practice.) Now if we make the models probabilistic and try to design them such that there is a non-zero chance that the data would be a possible sample from the model, then the likelihood can be non-zero. But it doesn't seem necessary to do this - models that are false can still give predictions that are useful for decision-making. Also, it's not clear if we could make a probabilistic model that would have non-zero likelihoods for something as complex as the climate that we could run on our available computers (and that isn't something obviously of low value for prediction like just giving probability 1/N to each of N days of observed data). So it still seems like it would be valuable to have a principled way of predicting using models that give a zero likelihood of the data.

"the central challenge is to find rigorous approximations of the true underlying models. The main field I know of which studies this sort of problem directly is statistical mechanics, and a number of reasonably-general-purpose tools exist in that field which could potentially be applied in other areas (e.g. this)."

Yes I agree. Thanks for the link - it looks very relevant and I'll check it out. Edit - I'll just add, echoing part of my reply to Kenny's answer, that whilst statistical averaging has got human modellers a certain distance, adding representations of processes whose effects get missed by the averaging seems to add a lot of value (e.g. tropical thunderstorms in the case of climate). So there seems to be something additional to averaging that can be used, to do with coming up with simplified models of processes you can see are missed out by the averaging.

On causality, whilst of course correcting this is desirable, if the models we can afford to compute with can't reproduce the data, then presumably they are also not reproducing the correct causal graph exactly? And any causal graph we could compute with will not be able to reproduce the data? (Else it would seem that a causal graph could somehow hugely compress the true equations without information loss - great if so!)

Replies from: johnswentworth, johnswentworth
comment by johnswentworth · 2020-04-13T18:33:44.090Z · LW(p) · GW(p)

Side note: one topic I've been reading about recently which is directly relevant to some of your examples (e.g. thunderstorms) is multiscale modelling. You might find it interesting.

Replies from: OxDoc
comment by ClimateDoc (OxDoc) · 2020-04-13T20:36:47.708Z · LW(p) · GW(p)

Thanks, yes this is very relevant to thinking about climate modelling, with the dominant paradigm being that we can separately model phenomena above and below the resolved scale - there's an ongoing debate, though, about whether a different approach would work better, and it gets tricky when the resolved scale gets close to the size of important types of weather system.

comment by johnswentworth · 2020-04-13T17:47:08.703Z · LW(p) · GW(p)
To take the climate example, say scientists had figured out that there were a biological feedback that kicks in once global warming has gone past 2C (e.g. bacteria become more efficient at decomposing soil and releasing CO2). Suppose you have one model that includes a representation of that feedback (e.g. as a subprocess) and one that does not but is equivalent in every other way (e.g. is coded like the first model but lacks the subprocess). Then isn't the second model simpler according to metrics like the minimum description length, so that it would be weighted higher if we penalised models using such metrics? But this seems the wrong thing to do, if we think the first model is more likely to give a good prediction.

The trick here is that the data on which the model is trained/fit has to include whatever data the scientists used to learn about that feedback loop in the first place. As long as that data is included, the model which accounts for it will have lower minimum description length. (This fits in with a general theme: the minimum-complexity model is simple and general-purpose; the details are learned from the data.)

Now the thought that occurred to me when writing that is that the data the scientists used to deduce the existence of the feedback ought to be accounted for by the models that are used, and this would give low posterior weight to models that don't include the feedback. But doing this in practice seems hard.

... I'm responding as I read. Yup, exactly. As the Bayesians say, we do need to account for all our prior information if we want reliably good results. In practice, this is "hard" in the sense of "it requires significantly more complicated programming", but not in the sense of "it increases the asymptotic computational complexity". The programming is more complicated mainly because the code needs to accept several qualitatively different kinds of data, and custom code is likely needed for hooking up each of them. But that's not a fundamental barrier; it's still the same computational challenges which make the approach impractical.

it's not clear to me if there would be a way to tell between models that represent the process but don't connect it properly to predicting the climate...

Again, we need to include whatever data allowed scientists to connect it to the climate in the first place. (In some cases this is just fundamental physics, in which case it's already in the model.)

If our models were deterministic, then if they were not true, wouldn't it be impossible for them to produce the observed data exactly, so that the likelihood of the data given any of those models would be zero? (Unless there was more than one process that could give rise to the same data, which seems unlikely in practice.)

Picture a deterministic model which uses fundamental physics, and models the joint distribution of position and momentum of every atom comprising the Earth. The unknown in this model is the initial conditions - the initial position and momentum of every particle (also particle identity, i.e. which element/isotope each is, but we'll ignore that). Now, imagine how many of the possible initial conditions are compatible with any particular high-level data we observe. It's a massive number!

Point is: the deterministic part of a model of a fundamental physical model is the dynamics; the initial conditions are still generally unknown. Conceptually, when we fit the data, we're mostly looking for initial conditions which match. So zero likelihoods aren't really an issue; the issue is computing with a joint distribution over position and momentum of so many particles. That's what statistical mechanics is for.

whilst statistical averaging has got human modellers a certain distance, adding representations of processes whose effects get missed by the averaging seems to add a lot of value

The corresponding problem in statistical mechanics is to identify the "state variables" - the low-level variables whose averages correspond to macroscopic observables. For instance, the ideal gas law uses density, kinetic energy, and force on container surfaces (whose macroscopic averages correspond to density, temperature, and pressure). Fluid flow, rather than averaging over the whole system, uses density and particle velocity within each little cell of space.

The point: if an effect is "missed by averaging", that's usually not inherent to averaging as a technique. The problem is that people average over poorly-chosen features.

Jaynes argued that the key to choosing high-level features is reproducibility: what high-level variables do experimenters need to control in order to get a consistent result distribution? If we consistently get the same results without holding X constant (where X includes e.g. initial conditions of every particle), then apparently X isn't actually relevant to the result, so we can average out X. Also note that there's some degrees of freedom in what "results" we're interested in. For instance, turbulence has macroscopic behavior which depends on low-level initial conditions, but the long-term time average of forces from a turbulent flow usually doesn't depend on low-level initial conditions - and for engineering purposes, it's often that time average which we actually care about.

if the models we can afford to compute with can't reproduce the data, then presumably they are also not reproducing the correct causal graph exactly? And any causal graph we could compute with will not be able to reproduce the data?

Once we move away from stat mech and approximations of low-level models, yes, this becomes a problem. However, two counterpoints. First, this is the sort of problem where the output says "well, the best model is one with like a gazillion edges, and there's a bunch that all fit about equally well, so we have no idea what will happen going forward". That's unsatisfying, but at least it's not wrong. Second, if we do get that sort of result, then it probably just isn't possible to do better with the high-level variables chosen. Going back to reproducibility and selection of high-level variables: if we've omitted some high-level variable which really does impact the results we're interested in, then "we have no idea what will happen going forward" really is the right answer.

Replies from: OxDoc
comment by ClimateDoc (OxDoc) · 2020-04-13T20:25:58.830Z · LW(p) · GW(p)

Thanks again.

I think I need to think more about the likelihood issue. I still feel like we might be thinking about different things - when you say "a deterministic model which uses fundamental physics", this would not be in the set of models that we could afford to run to make predictions for complex systems. For the models we could afford to run, it seems to me that no choice of initial conditions would lead them to match the data we observe, except by extreme coincidence (analogous to a simple polynomial just happening to pass through all the datapoints produced by a much more complex function).

I've gone through Jaynes' paper now from the link you gave. His point about deciding what macroscopic variables matter is well-made. But you still need a model of how the macroscopic variables you observe relate to the ones you want to predict. In modelling atmospheric processes, simple spatial averaging of the fluid dynamics equations over resolved spatial scales gets you some way, but then changing the form of the function relating the future to present states ("adding representations of processes" as I put it before) adds additional skill. And Jaynes' paper doesn't seem to say how you should choose this function.

Replies from: johnswentworth
comment by johnswentworth · 2020-04-15T17:18:52.634Z · LW(p) · GW(p)
For the models we could afford to run, it seems to me that no choice of initial conditions would lead them to match the data we observe, except by extreme coincidence (analogous to a simple polynomial just happening to pass through all the datapoints produced by a much more complex function).

Ok, let's talk about computing with error bars, because it sounds like that's what's missing from what you're picturing.

The usual starting point is linear error - we assume that errors are small enough for linear approximation to be valid. (After this we'll talk about how to remove that assumption.) We have some multivariate function - imagine that is the full state of our simulation at some timestep, and calculates the state at the next timestep. The value of in our program is really just an estimate of the "true" value ; it has some error . As a result, the value of of in our program also has some error . Assuming the error is small enough for linear approximation to hold, we have:

where is the Jacobian, i.e. the matrix of derivatives of every entry of with respect to every entry of .

Next, assume that has covariance matrix , and we want to compute the covariance matrix of . We have a linear relationship between and so we use the usual formula for linear transformation of covariance:

Now imagine iterating this at every timestep: we compute the timestep itself, then differentiate that timestep, and matrix multiply our previous uncertainty on both sides by the derivative matrix to get the new uncertainty:

Now, a few key things to note:

  • For most systems of interest, that uncertainty is going to grow over time, usually exponentially. That's correct: in a chaotic system, if the initial conditions are uncertain, then of course we should become more and more uncertain about the system's state over time.
  • Those formulas only propagate uncertainty in previous state to uncertainty in the next state. Really, there's also new uncertainty introduced at each timestep, e.g. from error in itself (i.e. due to averaging) or from whatever's driving the system. Typically, such errors are introduced as an additive term - i.e. we compute the covariance in introduced by each source of error, and add them to the propagated covariance matrix at each timestep.
  • Actually storing the whole covariance matrix would take space if has elements, which is completely impractical when is the whole state of a finite element simulation. We make this practical the same way we make all matrix operations practical in numerical computing: exploit sparsity/structure. This is application-specific, but usually the covariance can be well-approximated as the sum of sparse "local" covariances and low-rank "global" covariances.
  • Likewise with the update: we don't actually want to compute the n-by-n derivative matrix and then matrix-multiply with the covariance. Most backpropagation libraries expose the derivative as a linear operator rather than an explicit matrix, and we want to use it that way. Again, specifics will vary, depending on the structure of and of the (approximated) covariance matrix.
  • In many applications, we have data coming in over time. That data reduces our uncertainty every time it comes in - at that point, we effectively have a Kalman filter. If enough data is available, the uncertainty remains small enough for the linear approximation to continue to hold, and whole thing works great.
  • If the uncertainty does become too large for linear approximation, then we need to resort to other methods for representing uncertainty, rather than just a covariance matrix. Particle filters are one simple-but-effective fallback, and can be combined with linear uncertainty as well.

In general, if this sounds interesting and you want to know more, it's covered in a lot of different contexts. I first saw most of it in an autonomous vehicles course; besides robotics, it's also heavily used in economic models, and sometimes systems/control theory courses will focus on this sort of stuff.

Is this starting to sound like a model for which the observed data would have nonzero probability?

Replies from: OxDoc
comment by ClimateDoc (OxDoc) · 2020-04-15T18:13:29.378Z · LW(p) · GW(p)

Do you mean you'd be adding the probability distribution with that covariance matrix on top of the mean prediction from f, to make it a probabilistic prediction? I was talking about deterministic predictions before, though my text doesn't make that clear. For probabilistic models, yes adding an uncertainty distribution may make result in non-zero likelihoods. But if we know the true dynamics are deterministic (pretend there's no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.

Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either - in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor - and its the attractor geometry (conditional on boundary conditions) that we'd seem to really want to assess. Perhaps then it would have a higher likelihood than every other model, but it's not obvious to me, and it's not obvious that there's not a better metric for leading to good inferences when we don't have the true model.

Basically the logic that says to use Bayes for deducing the truth does not seem to carry over in an obvious way (to me) to the case when we want to predict but can't use the true model.

Replies from: johnswentworth
comment by johnswentworth · 2020-04-15T18:43:10.096Z · LW(p) · GW(p)
But if we know the true dynamics are deterministic (pretend there's no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.

Ah, that's where we need to apply more Bayes. The underlying system may be deterministic at the macroscopic level, but that does not mean we have perfect knowledge of all the things which effect the system's trajectory. Most of the uncertainty in e.g. a weather model would not be quantum noise, it would be things like initial conditions, measurement noise (e.g. how close is this measurement to the actual average over this whole volume?), approximation errors (e.g. from discretization of the dynamics), driving conditions (are we accounting for small variations in sunlight or tidal forces?), etc. The true dynamics may be deterministic, but that doesn't mean that our estimates of all the things which go into those dynamics have no uncertainty. If the inputs have uncertainty (which of course they do), then the outputs also have uncertainty.

The main point of probabilistic models is not to handle "random" behavior in the environment, it's to quantify uncertainty resulting from our own (lack of) knowledge of the system's inputs/parameters.

Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either - in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor...

Yeah, you're pointing to an important issue here, although it's not actually likelihoods which are the problem - it's point estimates. In particular, that makes linear approximations a potential issue, since they're implicitly approximations around a point estimate. Something like a particle filter will do a much better job than a Kalman filter at tracing out an attractor, since it accounts for nonlinearity much better.

Anyway, reasoning with likelihoods and posterior distributions remains valid regardless of whether we're using point estimates. When the system is chaotic but has an attractor, the posterior probability of the system state will end up smeared pretty evenly over the whole attractor. (Although with enough fine-grained data, we can keep track of roughly where on the attractor the system is at each time, which is why Kalman-type models work well in that case.)

comment by Kenny · 2020-04-13T01:10:31.151Z · LW(p) · GW(p)

So when we say it is "too costly to model from first principles", we should keep in mind that we don't mean the true model space can't even be written down efficiently.

I'm confused. Are you really claiming that modeling the Earth's climate can be written down "efficiently"? What exactly do you mean by 'efficiently'? What would a sketch of an efficient description of the "true model space" for the Earth's climate be?

Replies from: johnswentworth
comment by johnswentworth · 2020-04-13T01:46:09.731Z · LW(p) · GW(p)

Extreme answer: just point AIXI at wikipedia. That's a bit tongue-in-cheek, but it illustrates the concepts well. The actual models (i.e. AIXI) can be very general and compact; rather than AIXI, a specification of low-level physics would be a more realistic model to use for climate. Most of the complexity of the system is then learned from data - i.e. historical weather data, a topo map of the Earth, composition of air/soil/water samples, etc. An exact Bayesian update of a low-level physical model on all that data should be quite sufficient to get a solid climate model; it wouldn't even take an unrealistic amount of data (data already available online would likely suffice). The problem is that we can't efficiently compute that update, or efficiently represent the updated model - we're talking about a joint distribution over positions and momenta of every particle comprising the Earth, and that's even before we account for quantum. But the prior distribution over positions and momenta of every particle we can represent easily - just use something maxentropic, and the data will be enough to figure out the (relevant parts of the) rest.

So to answer your specific questions:

  • the "true model space" is just low-level physics
  • by "efficiently", I mean the code would be writable by a human and the "training" data would easily fit on your hard drive
Replies from: yandong-zhang, Kenny
comment by Yandong Zhang (yandong-zhang) · 2020-04-14T00:37:00.645Z · LW(p) · GW(p)

Can we reduce the issue of “we can't efficiently compute that update” by adding sensors?

What if we could get more data ? —— if facing such type of difficulties, I would ask that question first.

Replies from: johnswentworth, Kenny
comment by johnswentworth · 2020-04-13T16:41:49.470Z · LW(p) · GW(p)

Yeah, the usual mechanism by which more data reduces computational difficulty is by directly identifying the values some previously-latent variables. If we know the value of a variable precisely, then that's easy to represent; the difficult-to-represent distributions are those where there's a bunch of variables whose uncertainty is large and tightly coupled.

comment by Kenny · 2020-04-13T18:39:39.335Z · LW(p) · GW(p)

No, he's referring to something like performing a Bayesian update over all computable hypotheses – that's incomputable (i.e. even in theory). It's infinitely beyond the capabilities of even a quantum computer the size of the universe.

Think of it as a kind of (theoretical) 'upper bound' on the problem. None of the actual computable (i.e. on real-world computers built by humans) approximations to AIXI are very good in practice.

Replies from: johnswentworth
comment by johnswentworth · 2020-04-13T19:14:55.677Z · LW(p) · GW(p)

The AIXI thing was a joke; a Bayesian update on low-level physics with unknown initial conditions would be superexponentially slow, but it certainly isn't uncomputable. And the distinction does matter - uncomputability usually indicates fundamental barriers even to approximation, whereas superexponential slowness does not (at least in this case).

comment by Kenny · 2020-04-13T18:31:59.359Z · LW(p) · GW(p)

That's what I thought you might have meant.

In a sense, existing climate models are already "low-level physics" except that "low-level" means coarse aggregates of climate/weather measurements that are so big that they don't include tropical cyclones! And, IIRC, those models are so expensive to compute that they can only be computed on supercomputers!

But I'm still confused as to whether you're claiming that someone could implement AIXI and feed it all the data you mentioned.

the prior distribution over positions and momenta of every particle we can represent easily - just use something maxentropic, and the data will be enough to figure out the (relevant parts of the) rest.

You seem to be claiming that "Wikipedia" (or all of the scientific data ever measured) would be enough to generate "the prior distribution over positions and momenta of every particle" and that this data would easily fit on a hard drive. Or are you claiming that such an efficient representation exists in theory? I'm still skeptical of the latter.

The problem is that we can't efficiently compute that update, or efficiently represent the updated model - we're talking about a joint distribution over positions and momenta of every particle comprising the Earth, and that's even before we account for quantum.

This makes me believe that you're referring to some kind of theoretical algorithm. I understood the asker to wanting something (efficiently) computable, at least relative to actual current climate models (i.e. something requiring no more than supercomputers to use).

Replies from: johnswentworth, OxDoc
comment by johnswentworth · 2020-04-13T19:10:26.254Z · LW(p) · GW(p)
But I'm still confused as to whether you're claiming that someone could implement AIXI and feed it all the data you mentioned.

That was a joke, but computable approximations of AIXI can certainly be implemented. For instance, a logical inductor run on all that data would be conceptually similar for our purposes.

You seem to be claiming that "Wikipedia" (or all of the scientific data ever measured) would be enough to generate "the prior distribution over positions and momenta of every particle" and that this data would easily fit on a hard drive.

No, wikipedia or a bunch of scientific data (much less than all the scientific data ever measured), would be enough data to train a solid climate model from a simple prior over particle distributions and momenta. It would definitely not be enough to learn the position and momentum of every particle; a key point of stat mech is that we do not need to learn the position and momentum of every particle in order to make macroscopic predictions. A simple maxentropic prior over microscopic states plus a (relatively) small amount of macroscopic data is enough to make macroscopic predictions.

This makes me believe that you're referring to some kind of theoretical algorithm.

The code itself need not be theoretical, but it would definitely be superexponentially slow to run. Making it efficient is where stat mech, multiscale modelling, etc come in. The point I want to make is that the system's "complexity" is not a fundamental barrier requiring fundamentally different epistemic principles.

Replies from: Kenny
comment by Kenny · 2020-04-13T23:27:08.522Z · LW(p) · GW(p)

... wikipedia or a bunch of scientific data (much less than all the scientific data ever measured), would be enough data to train a solid climate model from a simple prior over particle distributions and momenta. It would definitely not be enough to learn the position and momentum of every particle; a key point of stat mech is that we do not need to learn the position and momentum of every particle in order to make macroscopic predictions. A simple maxentropic prior over microscopic states plus a (relatively) small amount of macroscopic data is enough to make macroscopic predictions.

That's clearer to me, but I'm still skeptical that that's in fact possible. I don't understand how the prior can be considered "over particle distributions and momenta", except via the theories and models of statistical mechanics, i.e. assuming that those microscopic details can be ignored.

The point I want to make is that the system's "complexity" is not a fundamental barrier requiring fundamentally different epistemic principles.

I agree with this. But I think you're eliding how much work is involved in what you described as:

Making it efficient is where stat mech, multiscale modelling, etc come in.

I wouldn't think that standard statistical mechanics would be sufficient for modeling the Earth's climate. I'd expect fluid dynamics is also important as well as chemistry, geology, the dynamics of the Sun, etc.. It's not obvious to me that statistical mechanics would be effective alone in practice.

Replies from: johnswentworth
comment by johnswentworth · 2020-04-14T00:03:55.646Z · LW(p) · GW(p)

Ah... I'm talking about stat mech in a broader sense than I think you're imagining. The central problem of the field is the "bridge laws" defining/expressing macroscopic behavior in terms of microscopic behavior. So, e.g., deriving Navier-Stokes from molecular dynamics is a stat mech problem. Of course we still need the other sciences (chemistry, geology, etc) to define the system in the first place. The point of stat mech is to take low-level laws with lots of degrees of freedom, and derive macroscopic laws from them. For very coarse, high-level models, the "low-level model" might itself be e.g. fluid dynamics.

I think you're eliding how much work is involved in what you described as...

Yeah, this stuff definitely isn't easy. As you argued above, the general case of the problem is basically AGI (and also the topic of my own research). But there are a lot of existing tricks and the occasional reasonably-general-tool, especially in the multiscale modelling world and in Bayesian stat mech.

Replies from: Kenny
comment by Kenny · 2020-04-14T00:20:09.807Z · LW(p) · GW(p)

Yes, I don't think we really disagree. My prior (prior to this extended comments discussion) was that there are lots of wonderful existing tricks, but there's no real shortcut for the fully general problem and any such shortcut would be effectively AGI anyways.

comment by ClimateDoc (OxDoc) · 2020-04-13T20:29:09.796Z · LW(p) · GW(p)

climate models are already "low-level physics" except that "low-level" means coarse aggregates of climate/weather measurements that are so big that they don't include tropical cyclones!

Just as as aside, a typical modern climate model will simulate tropical cyclones as emergent phenomena from the coarse-scale fluid dynamics, albeit not enough of the most intense ones. Though, much smaller tropical thunderstorm-like systems are much more crudely represented.

Replies from: johnswentworth, Kenny
comment by johnswentworth · 2020-04-13T22:39:58.595Z · LW(p) · GW(p)

Tangential, but now I'm curious... do you know what discretization methods are typically used for the fluid dynamics? I ask because insufficiently-intense cyclones sound like exactly the sort of thing APIC methods were made to fix, but those are relatively recent and I don't have a sense for how much adoption they've had outside of graphics.

Replies from: OxDoc, Kenny
comment by ClimateDoc (OxDoc) · 2020-04-14T20:44:46.152Z · LW(p) · GW(p)

do you know what discretization methods are typically used for the fluid dynamics?

There's a mixture - finite differencing used to be used a lot but seems to be less common now, semi-Lagrangian advection seems to have taken over from that in models that used it, then some work by doing most of the computations in spectral space and neglecting the smallest spatial scales. Recently newer methods have been developed to work better on massively parallel computers. It's not my area, though, so I can't give a very expert answer - but I'm pretty sure the people working on it think hard about trying to not smooth out intense structures (though, that has to be balanced against maintaining numerical stability).

comment by Kenny · 2020-04-13T23:37:06.291Z · LW(p) · GW(p)

How much are 'graphical' methods like APIC incorporated elsewhere in general?

My intuition has certainly been pumped to the effect that models that mimic visual behavior are likely to be useful more generally, but maybe that's not a widely shared intuition.

comment by Kenny · 2020-04-13T23:33:20.619Z · LW(p) · GW(p)

I would have hoped that was the case, but that's interesting that both large and small ones are apparently not so easily emergent.

I wonder whether the models are so coarse that the cyclones that do emerge are in a sense the minimum size. That would readily explain the lack of smaller emergent cyclones. Maybe larger ones don't emerge because the 'next larger size' is too big for the models. I'd think 'scaling' of eddies in fluids might be informative: What's the smallest eddy possibly in some fluid? What other eddy sizes are observed (or can be modeled)?

Replies from: johnswentworth, OxDoc
comment by johnswentworth · 2020-04-14T00:10:10.875Z · LW(p) · GW(p)
What's the smallest eddy possibly in some fluid?

Not sure if this was intended to be rhetorical, but a big part of what makes turbulence difficult is that we see eddies at many scales, including very small eddies (at least down to the scale that Navier-Stokes holds). I remember a striking graphic about the onset of turbulence in a pot of boiling water, in which the eddies repeatedly halve in size as certain parameter cutoffs are passed, and the number of eddies eventually diverges - that's the onset of turbulence.

Replies from: Kenny
comment by Kenny · 2020-04-14T00:18:07.493Z · LW(p) · GW(p)

Sorry for being unclear – it was definitely not intended to be rhetorical!

Yes, turbulence was exactly what I was thinking about. At some small enough scale, we probably wouldn't expect to 'find' or be able to distinguish eddies. So there's probably some minimum size. But then is there any pattern or structure to the larger sizes of eddies? For (an almost certainly incorrect) example, maybe all eddies are always a multiple of the minimum size and the multiple is always an integer power of two. Or maybe there is no such 'discrete quantization' of eddy sizes, tho eddies always 'split' into nested halves (under certain conditions).

It certainly seems the case tho that eddies aren't possible as emergent phenomena at a scale smaller than the discretization of the approximation itself.

comment by ClimateDoc (OxDoc) · 2020-04-14T20:50:11.592Z · LW(p) · GW(p)

I wonder whether the models are so coarse that the cyclones that do emerge are in a sense the minimum size.

It's not my area, but I don't think that's the case. My impression is that part of what drives very high wind speeds in the strongest hurricanes is convection on the scale of a few km in the eyewall, so models with that sort of spatial resolution can generate realistically strong systems, but that's ~20x finer than typical climate model resolutions at the moment, so it will be a while before we can simulate those systems routinely (though, some argue we could do it if we had a computer costing a few billion dollars).

Replies from: Kenny
comment by Kenny · 2020-04-14T22:00:00.544Z · LW(p) · GW(p)

Thanks! That's very interesting to me.

It seems like it might be an example of relatively small structures having potentially arbitrarily large long-term effects on the state of the entire system.

It could be the case tho that the overall effects of cyclones are still statistical at the scale of the entire planet's climate.

Regardless, it's a great example of the kind of thing for which we don't yet have good general learning algorithms.