# LINK: AI Researcher Yann LeCun on AI function

post by shminux · 2013-12-11T00:29:52.608Z · score: 2 (12 votes) · LW · GW · Legacy · 82 commentsYann LeCun, now of Facebook, was interviewed by The Register. It is interesting that his view of AI is apparently that of a prediction tool:

"In some ways you could say intelligence is all about prediction," he explained. "What you can identify in intelligence is it can predict what is going to happen in the world with more accuracy and more time horizon than others."

rather than of a world optimizer. This is not very surprising, given his background in handwriting and image recognition. This "AI as intelligence augmentation" view appears to be prevalent among the AI researchers in general.

## 82 comments

Comments sorted by top scores.

Prediction cannot solve causal problems.

"ML person thinks AI is about what ML people care about. News at 11."

Ilya, I don't think it is very fair for you to bludgeon people with terminology / appeals to authority (as you do later in a couple of the sub-threads to this comment) especially given that causality is a somewhat niche subfield of machine learning. I.e. I think many people in machine learning would disagree with the implicit assumptions in the claim "probabilistic models cannot capture causal information". I realize that this is true by definition under the definitions preferred by causality researchers, but the assumption here seems to be that it's more natural to make causality an ontologically fundamental aspect of the model, whereas it's far from clear to me that this is the most natural thing to do (i.e. you can imagine learning about causality as a feature of the environment). In essence, you are asserting that "do" is an ontologically fundamental notion, but I personally think of it as a notion that just happens to be important enough to many of the prediction tasks we care about that we hard-code it as a feature of the model, and supply the causal information by hand. I suspect the people you argue with below have similar intuitions but lack the terminology to express them to your satisfaction.

I'll freely admit that I'm not an expert on causality in particular, so perhaps some of what I say above is off-base. But if I'm also below the bar for respectful discourse then your target audience is small indeed.

[ Upvoted. ]

If anyone felt I was uncivil to them in any subthread, I hereby apologize here.

I am not sure causality is a subfield of ML in the sense that I don't think many ML people care about causality. I think causal inference is a subfield of stats (lots of talks with the word "causal" at this year's JSM). I think it's weird that stats and ML are different fields, but that's a separate discussion.

I think it is possible to formalize causality without talking about interventions as Pearl et al. thinks of them, for example people in reinforcement learning do this. But if you start to worry about e.g. time-varying confounders, and you are not using interventions, you will either get stuff wrong, or have to reinvent interventions again. Which would be silly -- so just learn about the Neyman/Rubin model and graphs. It's the formalism that handles all the "gotchas" correctly. (In fact, until interventionists came along, people didn't even have the math to realize that time-varying confounders are a "gotcha" that needs special handling!)

By the way, the only reason I am harping on time-varying confounders is because it is a historically important case that I can explain with a 4 node example. There are lots of other, more complicated "gotchas," of course.

Interventions seem to pop up/get reinvented in seemingly weird places, like the pi constant:

In channels with feedback (thus causality arises!)

http://www.adaptiveagents.org/bayesian_control_rule

http://en.wikipedia.org/wiki/Thompson_sampling

In multi-armed bandit problems (which are related to longitudinal studies in causal inference).

http://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator

In handling missing data (can view "missingness" as a causal property). Note the phrasing in the second link: "given the observed data, the missingness *mechanism* does not depend on the unobserved data." This is precisely the "no unobserved confounders" assumption in causal inference. Not surprisingly the correction is the same as in causal inference.

Also in figuring out what the dimension of a statistical hidden variable DAG model is. For example if A,B,C,D are binary, and U, W are unrestricted, then the dimension of the model

{ p(a,b,c,d) = \sum_{u,w} p(a,b,c,d,u,w) | p(a,b,c,d,u,w) factorizes wrt A -> B -> C -> D, A <- U -> C, B <- W -> D } is 13, not 15, which is weird, but there is an intervention-inspired explanation for why.

you can imagine learning about causality as a feature of the environment

I don't think you can get something for nothing. You will need causal assumptions somewhere.

Thanks Ilya, that was a lot of useful context and I wasn't aware that causality was more in stats than ML. For the record, I think that causality is super-interesting and cool, I hope that I didn't sound too negative by calling it "niche" (I would have described e.g. Bayesian nonparametrics, which I used to do research in, the same way, although perhaps it's unfair to lump in causality with nonparametric Bayes, since the former has a much more distinguished history).

I agree with pretty much everything you say above, although I'm still confused about "you will need causal assumptions somewhere". If I could somehow actually do inference under the Solomonoff prior, do you think that some notion of causality would not pop out? I'd understand if you didn't want to take the time to explain it to me; I've had this conversation with 2 other causality people already and am still not quite sure I understand what is meant by "you need causal assumptions to get causal inferences". (Note I already agree that this is true *in the context of graphical models*, i.e. you can't distincuish between X->Y and X<-Y without do(X) or some similar information.)

Graphical models are only a "thing" because our brain dedicates lots of processing to vision, so, for instance, we immediately understand complicated conditional independence statements if expressed in the visual form of d-separation. In some sense, graphs in the context of graphical models do not really add any extra information mathematically that wasn't already encoded even without graphs.

Given this, I am not sure there really *is* a context for graphical models separate from the context of "variables and their relationships". What you are saying above is that we seem to need "something extra" to be able to tell the direction of causality in a two variable system. (For example, in an additive noise model you can do this:

http://machinelearning.wustl.edu/mlpapers/paper_files/ShimizuHHK06.pdf)

I think the "no causes in -- no causes out" principle is more general than that though. For example if we had a *three* variable case, with variables A, B, C where:

A is marginally independent of B, but no other independences hold, than the only faithful graphical explanation for this model is:

A -> C <- B

It seems that, unlike the previous case, here there is no causal ambiguity -- A points to C, and B points to C. However, since the only information you inserted into the procedure which gave you this graph is the information about conditional independences, all you are getting out is a graphical description of a conditional independence model (that is a Bayesian network, or a statistical DAG model). In particular, the absence of arrows aren't telling you about absent causal relationships (that is whether A would change if I intervene on C), but absent statistical relationships (that is, whether A is independent of B). The statistical interpretation of the above graph is that it corresponds to a set of densities:

{ p(A,B,C) | A is independent of B }

The same graph can also correspond to a causal model, where we are explicitly talking about interventions, that is:

{ p(A,B,C,C(a,b),B(a)) | C(a,b) is independent of B(a) is independent of A, p(B(a)) = p(B) }

where C(a,b) is just stats notation for do(.), that is p(C(a,b)) = p(C | do(a,b)).

This is a different object from before, and the interpretation of arrows is different. That is, the absence of an arrow from A to B means that intervening on A does not affect B, etc. This causal model *also* induces an independence model on the same graph, where the interpretation of arrows changes back to the statistical interpretation. However, we could imagine a very different causal model on three variables, that will *also* induce the same independence model where A is marginally independent of B. For example, maybe the set of all densities where the real direction of causality is A -> C -> B, but somehow the probabilities involved happened to line up in such a way that A is marginally independent of B. In other words, the mapping from causal to statistical models is many to one.

Given this view, it seems pretty clear that going from independences to causal models (even via a very complicated procedure) involves making some sort of assumption that makes the mapping one to one. Maybe the prior in Solomonoff induction gives this to you, but my intuitions about what non-computable procedures will do are fairly poor.

It sort of seems like Solomonoff induction operates at a (very low) level of abstraction where interventionist causality isn't really necessary (because we just figure out what the observable environment as a whole -- including action-capable agents, etc. -- will do), and thus isn't explicitly represented. This is similar to how Blockhead (http://en.wikipedia.org/wiki/Blockhead_(computer_system%29) does not need an explicit internal model of the other participant in the conversation.

I think Solomonoff induction is sort of a boring subject, if one is interested in induction, in the same sense that Blockhead is boring if one is interested in passing the Turing test, and particle physics is boring if one is interested in biology.

Agreed. And search is not the same problem as prediction, you can have a big search problem even when evaluating/predicting any single point is straightforward.

They are not the same problem but they are highly related:

If you have a very good heuristic, then search is trivial, and learning good heuristics from data is a prediction problem.

On the other hand, prediction problems such as Structured prediction (the stuff LeCun does) entail search, and moreover most machine learning algorithms also require some kind of search in the training phase.

What counts as a causal problem?

A sufficiently good predictor might be able to answer questions of the form "if I do X, what will happen thereafter?" and "if I do Y, what will happen thereafter?" even though what-will-happen-thereafter may be partly caused by doing X or Y.

Is your point that (to take a famous example with which I'm sure you're already very familiar) in a world where the correlation between smoking and lung cancer goes via a genetic feature that makes both happen, if you ask the machine that question it may in effect say "he chose to smoke, therefore he has that genetic quirk, therefore he will get lung cancer"? Surely any prediction device that would be called "intelligent" by anyone less gung-ho than, say, Ray Kurzweil would enable you to ask it questions like "suppose I -- with my current genome -- chose to smoke; then what?" and "suppose I -- with my current genome -- chose not to smoke; then what?".

I do agree that there are important questions a pure predictor can't help much with. For instance, the machine may be as good as you please at predicting the outcome of particle physics experiments, but it may not have (or we may not be able to extract from it in comprehensible form) any *theory* of what's going on to produce those outcomes.

What counts as a causal problem?

We give patients a drug, and some of them die. In fact, those that get the drug die more often than those that do not. Is the drug killing them or helping them? This is a very real problem we are facing right now, and getting it wrong results in people dying.

Surely any prediction device that would be called "intelligent" by anyone less gung-ho than, say, Ray Kurzweil would enable you to ask it questions like "suppose I -- with my current genome -- chose to smoke; then what?" and "suppose I -- with my current genome -- chose not to smoke; then what?".

I certainly hope that anything actually intelligent will be able to answer counterfactual questions of the kind you posed here. However, the standard language of prediction employed in ML is not able to even pose such questions, let alone answer them.

I don't get it. You gave some people the drug and some people you didn't. It seems pretty straightforward to estimate how likely someone is to die if you give them medicine.

It seems pretty straightforward to estimate how likely someone is to die if you give them medicine.

Certainly it's straightforward. Here's how one can apply your logic. You gave some people [the ones whose disease has progressed the most] the drug and some people you didn't [because their disease isn't so bad you're willing to risk it]; the % of people dying in the first drugged group is much higher than the % of deaths in the second non-drugged group; therefore, this drug is poison and you're a mass murderer.

See the problem?

Of course people say "but this is silly, obviously we need to condition on health status."

The point is: what if we can't? Or what if we there are other causally relevant factors here? In fact, what is "causally relevant" anyways... We need a *system*! ML people don't think about these questions very hard, generally, because culturally they are more interested in "algorithmic approaches" to prediction problems.

(This is a clarification of gwern's response to the grandparent, not a reply to gwern.)

The problem is the data is biased. The ML algorithm doesn't know whether the bias is a natural part of the data or artificially induced. Garbage In - Garbage Out.

However it can still be done if the algorithm has more information. Maybe some healthy patients ended up getting the medicine anyways and were far more likely to live, or some unhealthy ones didn't and were even more likely to die. Now it's straightforward prediction again: How likely is a patient to live based on their current health and whether or not they take the drug?

The problem is the data is biased. The ML algorithm doesn't know whether the bias is a natural part of the data or artificially induced. Garbage In - Garbage Out.

You're making up excuses. The data is not 'biased', it just is, nor is it garbage - it's not made up, no one is lying or falsifying data or anything like that. If your theory cannot handle clean data from a real-world problem, that's a big problem (*especially* if there are more sophisticated alternatives which can handle it).

Biased data is a real thing and this is a great example. *No* method can solve the problem you've given without additional information.

This is not biased data. No one tampered with it. No one preferentially left out some data. There is no Cartesian daemon tampering with you. It's a perfectly ordinary causal problem for which one has all the available data. If you run a regression on the data, you will get accurate predictions of future similar data - just not what happens when you intervene and realize the counterfactual. You can't throw your hands up and disdainfully refuse to solve the problem, proclaiming, 'oh, that's *biased*'. It may be hard, and the best available solution weak or require strong assumptions, but if that is the case, the correct method should say as much and specify what additional data or interventions would allow stronger conclusions.

I'm not certain why I used the word "bias". I think I was getting at that the data isn't representative of the population of interest.

Regardless, no other method can solve the problem specified *without* additional information (which you claimed). And with additional information, it's straightforward prediction again.

That is, condition on their prior health status, not just the fact they've been given the drug. And prior probabilities.

No method can solve the problem you've given without additional information.

What do you call "solving the problem"?

Any method will output some *estimates*. Some methods will output better estimates, some worse. As people have pointed out, this was an example of a real problem and yes, real-life data is usually pretty messy. We need methods which can handle messy data and not work just on spherical cows in vacuum.

Prediction by itself cannot solve causal *decision* problems (that's why AIXI is not the same as just a Solomonoff predictor) but your example is incorrect. What you're describing is a modelling problem, not a decision problem.

Sorry, I am not following you. Decision problems have the form of "What do you do in situation X to maximize a defined utility function?"

It is very easy to transform any causal modeling example into a decision problem. In this case: "here is an observational study where doctors give drugs to some cohort of patients. This is your data. Here's the correct causal graph for this data. Here is a set of new patients from the same cohort. Your utility function rewards you for minimizing patient deaths. Your actions are 'give the drug to everyone in the set' or 'do not give the drug to everyone in the set.' What do you do?"

Predictor algorithms, as understood by the machine learning community, cannot solve this class of problems correctly. These are not abstract problems! They happen all the time, and we need to solve them now, so you can't just say "let's defer solving this until we have a crazy detailed method of simulating every little detail of the way the HIV virus does its thing in these poor people, and the way this drug disrupts this, and the way side effects of the drug happen, etc. etc. etc."

Bayesian network learning and Bayesian network inference can, in principle, solve that problem.

Of course, if your model is wrong, and/or your dataset is degenerate, *any* approach will give you bad results: Gargbage in, garbage out.

Bayesian networks are statistical, not causal models.

I don't know what you mean by "causal model", but Bayesian networks can deal with the type of problems you describe.

A causal model to me is a set of joint distributions defined over potential outcome random variables.

And no, regardless of how often you repeat it, Bayesian networks cannot solve causal problems.

I have no idea what you're talking about.

gjm asked you what a causal problem was, you didn't provide a definition and instead gave an example of a problem which seems clearly solvable by Bayesian methods such as hidden Markov models (for prediction) or partially observable Markov decision processes (for decision).

(a) Hidden Markov models and POMDPs are probabilistic models, not necessarily Bayesian.

(b) I am using the standard definition of a causal model, first due to Neyman, popularized by Rubin. Everyone except some folks in the UK use this definition now. I am sorry if you are unfamiliar with it.

(c) Statistical models cannot solve causal problems. The number of times you repeat the opposite, while adding the word "clearly" will not affect this fact.

(a) Hidden Markov models and POMDPs are probabilistic models, not necessarily Bayesian.

According to Wikipedia:

A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. A HMM can be considered the simplest dynamic Bayesian network.

.

(b) I am using the standard definition of a causal model, first due to Neyman, popularized by Rubin. Everyone except some folks in the UK use this definition now. I am sorry if you are unfamiliar with it.

I suppose you mean this.

It seems to be a framework for the estimation of probability distributions from experimental data, under some independence assumptions.

(c) Statistical models cannot solve causal problems. The number of times you repeat the opposite, while adding the word "clearly" will not affect this fact.

You still didn't define "causal problem" and what you mean by "solve" in this context.

A "Bayesian network" is not necessarily a Bayesian model. Bayesian networks can be used with frequentist methods, and frequently are (see: the PC algorithm). I believe Pearl called the networks "Bayesian" to honor Bayes, and because of the way Bayes theorem is used when you shuffle probabilities around. The model does not necessitate Bayesian methods at all.

I don't mean to be rude, but are we operating at the level of string pattern matching, and google searches here?

You still didn't define "causal problem" and what you mean by "solve" in this context.

Sociological definition : "a causal problem" is a problem that people who do causal inference study. Estimating causal effects. Learning cause-effect relationships from data. Mediation analysis. Interference analysis. Decision theory problems. To "solve" means to get the right answer and thereby avoid going to jail for malpractice.

This is a bizarre conversation. Causal problems aren't something esoteric. Imagine if you kept insisting I define what an algebra problem is. There are all sorts of things you could read on this standard topic.

This is a bizarre conversation.

Looks a like a perfectly normal conversation where people insist on using different terminology sets :-/

One of these people has a good reason for preferring his terminology (e.g. it's standard, it's what everyone in the field actually uses, etc.) "Scott, can you define what a qubit is?", etc.

it's what everyone in the field actually uses

Yes, but you are talking to people outside of the field.

For example you tend to use the expression "prediction model" as an antonym to "causal model". That may be standard in your field, but that's not what it means outside of it.

For example you tend to use the expression "prediction model" as an antonym to "causal model".

Not an antonym, just a different thing that should not be confused. A qubit is a very different thing from a bit, with different properties.

That may be standard in your field, but that's not what it means outside of it.

"Sure, this definition of truth may be standard in *your* field, Prof. Tarsky, but that's not what we mean!" I guess we are done, then! Thanks for your time.

A "Bayesian network" is not necessarily a Bayesian model. Bayesian networks can be used with frequentist methods, and frequently are (see: the PC algorithm).

You can use frequentists methods to learn Bayesian networks from data, as with any other Bayesian model.

And you can also use Bayesian networks without priors to do things like maximum likelihood estimation, which isn't Bayesian sensu stricto, but I don't think this is relevant to this conversation, is it?

I don't mean to be rude, but are we operating at the level of string pattern matching, and google searches here?

No, we are operating at the level of trying to make sense of your claims.

Sociological definition : "a causal problem" is a problem that people who do causal inference study. Estimating causal effects. Learning cause-effect relationships from data. Mediation analysis. Interference analysis. Decision theory problems. To "solve" means to get the right answer and thereby avoid going to jail for malpractice.

Please try to reformulate without using the word "cause/causal".

The term has multiple meanings. You may be using a one of them assuming that everybody shares it, but that's not obvious.

I operate within the interventionist school of causality, whereby a causal effect has something to do with how interventions affect outcome variables. This is of course not the only formalization of causality, there are many many others. However, this particular one has been very influential, almost universally adopted among the empirical sciences, corresponds very closely to people's causal intuitions in many important respects (and has the mathematical machinery to move far beyond when intuitions fail), and has a number of other nice advantages I don't have the space to get into here (for example it helped to completely crack open the "what's the dimension of a hidden variable DAG" problem).

One consequence of the conceptual success of the interventionist school is that there is now a long list of properties we think a formalization of causality has to satisfy (that were first figured out within the interventionist framework). So we can now rule out bad formalizations of causality fairly easily.

I think getting into the interventionist school is too long for even a top level post, let alone a response post buried many levels deep in a thread. If you are interested, you can read a book about it (Pearl's book for example), or some papers.

Prediction algorithms, as used in ML today, completely fail on interventionist causal problems, which correspond, loosely speaking, to trying to figure out the effect of a randomized trial from observational data. I am not trying to give them a hard time about it, because that's not what the emphasis in ML is, which is perfectly fine!

You can think of this problem as just another type of "prediction problem," but this word usage simply does not conform to what people in ML mean by "prediction." There is an entirely different theory, etc.

A causal model to me is a set of joint distributions defined over potential outcome random variables.

Huh?

Can you expand on this, with special attention to the difference between the model and the result of a model, and to the differences from plain-vanilla Bayesian models which will also produce joint distributions over outcomes.

Sure. Here's the world's simplest causal graph: A -> B.

Rubin et al, who do not like graphs, will instead talk about a joint distribution:

p(A, B(a=1), B(a=0))

where B(a=1) means 'random variable B under intervention do(a=1)'. Assume binary A for simplicity here.

A causal model over A,B is a set of densities { p(A, B(a=1), B(a=0) | [ some property ] } The causal model for this graph would be:

{ p(A, B(a=1), B(a=0) | B(a=1) is independent of A, and B(a=0) is independent of A }

These assumptions are called 'ignorability assumptions' in the literature, and they correspond to the absence of confounding between A and B. Note that it took counterfactuals to define what 'absence of confounding' means.

A regular Bayesian network model for this graph is just the set of densities over A and B (since this graph has no d-separation statements). That is, it is the set { p(A,B) | [no assumptions] }. This is a 'statistical model,' because it is a set of regular old joint densities, with no mention of counterfactuals or interventions anywhere.

The same graph can correspond to very different things, you have to specify.

You could also have assumptions corresponding to "missing graph edges." For example, in the instrumental variable graph:

Z -> A -> B, with A <- U -> B, where we do not see U, we would have an assumption that states that B(a,z) = B(a,z') for all a,z,z'.

Please don't say "Bayesian model" when you mean "Bayesian network." People really should say "belief networks" or "statistical DAG models" to avoid confusion.

Please don't say "Bayesian model" when you mean "Bayesian network."

I do not mean "Bayesian networks". I mean Bayesian models of the kind e.g. described in Gelman's *Bayesian Data Analysis*.

p(A, B(a=1), B(a=0)) where B(a=1) means 'random variable B under intervention do(a=1)'. Assume binary A for simplicity here.

You still can express this as plain-vanilla conditional densities, can't you? "under intervention do(a=1)" is just a different way of saying "conditional on A=1", no?

A causal model over A,B is a set of densities { p(A, B(a=1), B(a=0) | [ some property ] }

and

with no mention of counterfactuals or interventions anywhere.

I don't see counterfactuals in your set of densities and how "interventions" are different from conditionality?

You still can express this as plain-vanilla conditional densities, can't you?

No. If conditioning was the same as interventions I could make it rain by watering my lawn and become a world class athlete by putting on a gold medal.

If conditioning was the same as interventions I could make it rain by watering my lawn

I don't understand -- can you unroll?

Well, since p(rain | grass wet) is high, it seems making the grass wet via a garden hose will make rain more likely. Of course you might say that "making the grass wet" and "seeing the grass wet" is not the same thing, in which case I agree!

The fact that these are not the same thing is why people say conditioning and interventions are not the same thing.

You can of course say that you can still use the language of conditional probability to talk about "doing events" vs "seeing events." But then you are just reinventing interventions (as will become apparent if you try to figure out axioms for your notation).

Well, since p(rain | grass wet) is high, it seems making the grass wet via a garden hose will make rain more likely.

That's a strawman. The conditional probability we're talking about has a clear (if explicitly unstated) temporal ordering: P(rain in the past | wet grass in the present).

But then you are just reinventing interventions

Talking about conditional probability was widespread long before people started talking about interventions.

It seems to me that the language of interventions, etc. is just a formalism that is convenient for certain types of analysis, but I'm not seeing that it *means* anything new.

That's a strawman. The conditional probability we're talking about has a clear (if explicitly unstated) temporal ordering: P(rain in the past | wet grass in the present).

You seem to be missing Ilya's point. He was arguing that if you regard "under intervention do(A = 1)" as equivalent to "conditional on A = 1" (as you suggested in a previous comment), then you should regard P(rain | do(grass wet)) as equivalent to P(rain | grass wet). But these are not in fact equivalent, and adding temporal ordering in there doesn't make them equivalent either. P(rain in the past | do(wet grass) in the present) = P(rain in the past), but P(rain in the past | wet grass in the present) != P(rain in the past) .

He was arguing that if you regard "under intervention do(A = 1)" as equivalent to "conditional on A = 1" (as you suggested in a previous comment), then you should regard P(rain | do(grass wet)) as equivalent to P(rain | grass wet).

There is obviously a difference between observational data and experiments.

But these are not in fact equivalent

No, because they're modeling different reality.

There is obviously a difference between observational data and experiments.

Yes! The difference is that experiments involve intervention. I thought the necessity of formalizing the notion of intervention is precisely what was under dispute here.

Well, kinda. I am not sure whether the final output -- the joint densities of outcomes -- will be different in a causal model compared to a *properly specified* conventional model.

To continue with the same example, it suffers from the expression "wet grass" meaning two different things -- either "I see wet grass" or "I made grass wet". This is your difference between just (a=1) and do(a=1) -- but conventional non-causal modeling doesn't have huge problems with this, it is fully aware of the difference.

And I don't know if it's **necessary** to formalize intervention. I freely concede that it's useful in certain areas but not so sure that's true for *all* areas.

Well, kinda. I am not sure whether the final output -- the joint densities of outcomes -- will be different in a causal model compared to a properly specified conventional model.

So, we *could* add a node to the graph for every single node, which corresponds to whether or not that node was the subject of an intervention. So you would talk about P(rain|grass is wet, ~I made it rain, ~I made the grass wet) vs. P(rain|grass is wet, ~I made it rain, I made the grass wet). But this means doubling the number of nodes in the dataset (which, since the number of probabilities is exponential in the number of nodes for a discrete dataset, is a *terrible* idea). You also might want to throw in a lot of consistency constraints which are not guaranteed to hold in an arbitrary graph, which makes things more awkward.

It is much simpler, conceptually and practically, to just have a rule to determine how interventions differ from observations in updating the state of the graph, that is, talking about P(rain|grass is wet) vs. P(rain|do(grass is wet)).

So, we could add a node to the graph for every single node, which corresponds to whether or not that node was the subject of an intervention.

In fact, Phil Dawid does precisely this. What he ends up with is still interventions. (Of course he (I think!) does not believe in *counterfactuals*, but that is a long discussion.)

So, we could add a node to the graph for every single node

That assumes we're doing graphs and networks.

My problems in this subthread really started when the causal model was defined as "a set of joint distributions defined over potential outcome random variables" -- notice how nothing like networks or interventions is mentioned here -- and I got curious why a plain-vanilla Bayesian model which also produces a set of joint distributions doesn't qualify.

It probably just was a bad definition.

Sorry this is a response to an old comment, but this is an easy to clarify question.

A potential outcome Y(a) is a random variable under an intervention, e.g. Y under do(a). It's just a different notation from a different branch of statistics.

We may or may not choose to use graphs to represent causality (or indeed probability). Some people like graphs, others do not. Graphs do not add anything, they are just a visual representation.

I agree with pragmatist's explanation. But let me add a bit more detail to illustrate that a temporal ordering will not save you here. Imagine instead of two variables we have three variables : rain (R), my grass being wet (G1), and my neighbor's grass being wet (G2). Clearly R preceeds both G1, and G2, and G1 and G2 are contemporaneous. In fact, we can even consider G2 to be my neighbor's grass 1 hour in the future (so clearly G1 preceeds G2!).

Also clearly, p(R = yes | G1 = wet) is high, and p(R = yes | G2 = wet) is high, also p(G1 = wet | R = yes) is high, and p(G2 = wet | R = yes) is high.

So by hosing my grass I am making it more likely than my neighbor's grass one hour from now will be wet?

Or, to be more succinct : http://www.smbc-comics.com/index.php?db=comics&id=1994#comic

Yeah, well, I've heard somewhere that correlation does not equal causation :-)

I agree that causal models are useful -- if only because they make explicit certain relationships which are implicit in plain-vanilla regular models and so trip up people on a regular basis.What I'm not convinced of is that you can't re-express that joint density on the outcomes in a conventional way even if it turns out to look a bit awkward.

Here's how this conversation played out.

Lumifer : "can we not express cause effect relationships via conditioning probabilities?"

me : "No: [example]."

Lumifer : "Ah, but this is silly because of time ordering information."

me : "Time ordering doesn't matter: [slight modification of example]."

Lumifer : "Yeah... causal models are useful, but it's not clear they cannot be expressed via conditioning probabilities."

I guess you can lead a horse to water, but you can't make him drink. I have given you everything, all you have to do is update and move on. Or not, it's up to you.

Decision problems have the form of "What do you do in situation X to maximize a defined utility function?"

Yes, but what you are describing is a modelling problem. "Is the drug killing them or helping them?" is not a decision problem, although "Which drug should we give them to save their lives?" is. These are two very different problems, possibly with different answers!

It is very easy to transform any causal modeling example into a decision problem.

Yes, but in the process it becomes a new problem. Although, you are right that modelling is in some respects an 'easier' problem than making decisions. That's also the reason I wrote my top-level comment, saying that it is true that something you can identify in an AI is the ability to model the world.

I guess my point was that there is a trivial reduction (in the complexity theory sense of the word) here, namely that decision theory is "modeling-complete." In other words, if we had algorithm for solving a certain class of decision problems correctly, we automatically have an algorithm for correctly handling the corresponding model (otherwise how could we get the decision problem right?)

Prediction cannot solve causal decision problems, but the reason it cannot is that it cannot solve the underlying modeling problem correctly. (If it could, there is nothing more to do, just integrate over the utility).

We give patients a drug [...] Is the drug killing them or helping them?

It seems to me that a sufficiently smart prediction machine could answer questions of this kind. E.g., suppose what it really is is a very fast universe simulator. Simulate a lot of patients, diddle with their environments, either give each one the drug or not, repeat with different sets of parameters. I'm not actually recommending this (it probably isn't possible, it produces *interesting* ethical issues if the simulation is really accurate, etc.) but the point is that *merely being a predictor* as such doesn't imply inability to answer causal questions.

the standard language of prediction employed in ML

Was Yann LeCun saying (1) "AI is all about prediction in the ordinary informal sense of the word" or (2) "AI is all about prediction in the sense in which it's discussed formally in the machine learning community"? I thought it was #1.

Simulate a lot of patients

Simulations (and computer programs in general -- think about how debuggers for computer programs work) are causal models, not purely predictive models. Your answer does no work, because being able to simulate at that level of fidelity means we are already Done with the science of what we are simulating. In particular our simulator will contain in it a very detailed causal model that would contain answers to everything we might want to know. The question is what do we do when our information isn't very good, not when we can just say "let's ask God."

This is a quote from an ML researcher today, who is talking about what is done today. And what is done today for purely predictive modeling are those crazy deep learning networks or support vector machines they have in ML. Those are algorithms specifically tailored to answering p(Y | X) kinds of questions (e.g. prediction questions), not causal questions.

edit: to add to this a little more. I think there is a general mathematical principle at play here, which is similar in spirit to Occam's razor. This principle is : "try to use the weakest assumptions needed to get the right answer." It is this principle that makes "Omega-style simulations" an unsatisfactory answer. It's a kind of overfitting of the entire scientific process.

A good enough prediction engine can substitute, to a degree, for a causal model. Obviously, not always and once you get outside of its competency domain it will break, but still -- if you can forecast very well what effects will an intervention produce, your need for a causal model is diminished.

I see. So then if I were to give you a causal decision problem, can you tell me what the right answer is using only a prediction engine? I have a list of them right here!

The general form of these problems is : "We have a causal model where an outcome is death. We only have observational data obtained from this causal model. We are interested in whether a given intervention will reduce the death rate. Should we do the intervention?"

Observational data is enough for the predictor, right? (But the predictor doesn't get to see what the causal model is, after all, it just works on observational data and is agnostic of how it came about).

So then if I were to give you a causal decision problem, can you tell me what the right answer is using only a prediction engine?

A **good enough** prediction engine, yes.

We only have observational data obtained from this causal model.

Huh? You don't obtain observational data from a model, you obtain it from reality.

Observational data is enough for the predictor, right?

That depends. I think I understand prediction models wider than you do. A prediction model can use any kind of input it likes if it finds it useful.

Huh? You don't obtain observational data from a model, you obtain it from reality.

Right, the data comes from the territory, but we assume the map is correct.

That depends. I think I understand prediction models wider than you do.

The point is, if your 'prediction model' has a rich enough language to incorporate the causal model, it's no longer purely a prediction model as everyone in the ML field understands it, because it can then also answer counterfactual questions. In particular, if your prediction model *only* uses the language of probability theory, it cannot incorporate any causal information because it cannot talk about counterfactuals.

So are you willing to take me up on my offer of solving causal problems with a prediction algorithm?

the data comes from the territory, but we assume the map is correct.

You don't need any assumptions about the model to get observational data. Well, you need *some* to recognize what are you looking at, but certainly you don't need to assume the correctness of a causal model.

no longer purely a prediction model as everyone in the ML field understands it

We may be having some terminology problems. Normally I call a "prediction model" anything that outputs testable forecasts about the future. Causal models are a subset of prediction models. Within the context of this thread I understand "prediction model" as a model which outputs forecasts and which does not depend on simulating the mechanics of the underlying process. It seems you're thinking of "pure prediction models" as something akin to "technical" models in finance which look at price history, only at price history, and nothing but the price history. So a "pure prediction model" would be to you something like a neural network into which you dump a lot of more or less raw data but you do not tweak the NN structure to reflect your understanding of how the underlying process works.

Yes, I would agree that a prediction model cannot talk about counterfactuals. However I would not agree that a prediction model can't successfully forecast on the basis of inputs it never saw before.

So are you willing to take me up on my offer of solving causal problems with a prediction algorithm?

Good prediction algorithms are domain-specific. I am not defending an assertion that you can get some kind of a Universal Problem Solver out of ML techniques.

Surely any prediction device that would be called "intelligent" by anyone less gung-ho than, say, Ray Kurzweil would enable you to ask it questions like "suppose I -- with my current genome -- chose to smoke; then what?" and "suppose I -- with my current genome -- chose not to smoke; then what?".

But it would be better if you could ask: "suppose I chose to smoke, but my genome and any other similar factors I don't know about were to stay as they are, then what?" where the other similar factors are things that cause smoking.

I don't think he said an AI is not a world-optimizer. He's saying "What you can **identify** in intelligence...", and this is absolutely true. An intelligent optimizer needs a world-model (a predictor) in order to work.

"What you can identify in intelligence is it can predict what is going to happen in the world" made me realize that there's a big conceptual split in the culture between intelligence and action. Intelligence and action aren't the same thing, but the culture almost has them in opposition.

As an outsider I kind of get the impression that there is a bit of looking-under-the-streetlamp syndrome going on here where world-modelling is assumed to be the most/only important feature because that's what we can currently do well. I got the same impression seeing Jeff Hawkins speaking at a conference recently.

It is interesting that his view of AI is apparently that of a prediction tool [...] rather than of a world optimizer.

If you can predict well enough, you can pass the Turing test - with a little training data.

This is not very surprising, given his background in handwriting and image recognition.

Could you elaborate on the connections between image recognition / interpretation and prediction? For this reply, it's fine to be only roughly accurate. (In case an inability to be sufficiently rigorous is what prevented you from sketching the connection.)

...naively, I think of intelligence as, say, an ability to identify and solve problems. Is LeCun saying perhaps that this is equal to prediction, or not as important as prediction, or that he's more interested in working on the latter?

Here is one of my efforts to explain the links: Machine Forecasting.

I concur. To predict, is everything there is about intelligence, really.

If a program could predict what I am going to type in here, it would be as intelligent as I am. At least in this domain. It could post instead of me.

But the same goes for every other domain. To predict every action of an intelligent agent, is to be as intelligent as he is.

I don't see a case, where this symmetry breaks down.

EDIT: But this is an old idea. Decades old, nothing very new.

You're talking about predicting the actions of an intelligent agent.

LeCun is talking about predicting the environment. These are two different concepts.

No, they are not. Every intelligent agent is just a piece of environment.

Intelligence can exist even in isolation from any other intelligent agents. Indeed, the first super-intelligent agent is likely to be without peer.

Look! The point is about predicting and intelligence. Doesn't matter what a predictor has around itself. It's just predicting. That's what it does.

And what does a (super)intelligence? It predicts. Very good, probably.

A dichotomy is needless.

Some examples:

- predicting the solution of a partial deferential equation
- predicting the best method to solve the given equation
- predicting how a process might behave
- predicting the best action you may take to achieve a goal
- predicting the best possible move in a given chess position
- predicting what a cyphered message is about ...

I predict, you can't give me a counterexample. Where an obviously intelligent solution can't be regarded as a prediction.

This went under the name of SP theory, long ago. That the prediction, compression and intelligence are the same thing, actually.

Almost tautological, but inescapable.

predicting the best possible move in a given chess position

In order to do this you need training data on what the optimal move is. This may not exist, or limits you to only doing as good as the player you are predicting.

Additionally, predicting is inherently less optimal than search, unless your predictions are 100% perfect. You are choosing moves because you *predict* they are optimal, rather than because it's the best move you've found. If for example, you try to play by predicting what a chessmaster would do, your play will necessarily be worse than if you just play normally.

They are closely related but not the same thing.

A counterexample is chess.

What an ideal chess player does? It predicts which move is optimal. May be a tricky feat, but he is good and predicts it well.

I looked this thread in past minutes and I clearly saw this "ideological division". Few people thinks as I do. Other say - you can't solve causal problems with a mere prediction. But don't give a clear example.

Don't you agree, that an ideal "best next chess move predictor" is the strongest possible chess player?

It predicts which move is optimal.

Maybe it would be useful to define terms, to make things more clear.

If you have a time-process X, and t observations from this process, a predictor comes up with a prediction as to what X_t+1 will be.

On the other hand, given a utility function f() on a series of possible outcomes Y from t+1 to infinity, a decision maker finds the best Y_t+1 to choose to maximize the utility function.

Note that the definition of these two things is not the same: a predictor is concerned about the past and immediate present, whereas a decision maker is concerned with the future.

a predictor comes up with a prediction as to what X_t+1 will be

This "t+1" might be "t+X". Results for a large X may be very bad. So as results for "t+1" may be bad. Still he do his best predictions.

whereas a decision maker is concerned with the future

He predicts the best decision, which can be taken.

In part of the interview LeCun is talking about predicting the actions of Facebook users, e.g. "Being able to predict what a user is going to do next is a key feature"

But not predicting everything they do and exactly what they'll type.