Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-04-18T03:09:05.561Z · LW · GW

>"If I value apples at 3 units and oranges at 1 unit, I don't want at 75%/25% split. I only want apples, because they're better! (I have no diminishing returns.)"

I think what I'd have to ask here is: if you only want apples, why are you spending your money on oranges? If you will not actually pay me 1 unit for an orange, why do you claim you value oranges at 1 unit?

Another construal: you value oranges at 1 orange per 1 unit because if I offer you a lottery over those and let you set the odds yourself, you will choose to set them to 50/50. You're indifferent to which one you receive, so you value them equally. We do the same trick with apples and find you value them at 3 units per 1 apple.

I now offer you a lottery between receiving 3 apples and 1 orange, and I'll let you pay 3 units to tilt the odds by one expected apple. Since the starting point was 1.5 expected apples and 0.5 expected oranges, and you insist you want only 3 expected apples and 0 expected oranges, I believe I can make you end up paying more than 3 units per apple now, despite our having established that as your "price".

The lesson is, I think, don't offer to pay finite amounts of money for outcomes you want literally zero of, as someone may in fact try to take you up on it.

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-08T03:42:26.001Z · LW · GW

So we could quibble over the details of Friston 2009, *buuuuut*...

I don't find it useful to take Friston at 110% of his word. I find it more useful to read him like I read all other cognitive modelers: as establishing a language and a set of techniques whose scientific rigor he demonstrates via their application to novel experiments and known data.

He's no more an absolute gold-standard than, say, Dennett, but his techniques have a certain theoretical elegance in terms of positing that the brain is built out of very few, very efficient core mechanisms, applied to abundant embodied training data, instead of very many mechanisms with relatively little training or processing power for each one.

Rather than quibble over him, I think that this morning in the shower I got what he means on a slightly deeper level, and now I seriously want to write a parody entitled, "So You Want to Write a Friston Paper".

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-08T03:36:54.885Z · LW · GW

Oh hey, so that's the original KL control paper. Saved!

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T16:28:18.298Z · LW · GW

Oh, I wasn't really trying at all to talk about what prediction-error minimization "really does" there, more to point out that it changes radically depending on your modeling assumptions.

The "distal causes" bit is also something I really want to find the time and expertise to formalize. There are studies of causal judgements grounding moral responsibility of agents and I'd really like to see if we can use the notion of distal causation to generalize from there to how people learn causal models that capture action-affordances.

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T15:30:44.422Z · LW · GW

>But this definitely seems like the better website to talk to Eli Sennesh on :)

Somewhat honored, though I'm not sure we've met before :-).

I'm posting here mostly by now, because I'm... somewhat disappointed with people saying things like, "it's bullshit" or "the mathematical parts of this model are pulled directly from the posterior".

IMHO, there's a lot to the strictly neuroscientific, biological aspects of the free-energy theory, and it integrates well with physics (good prediction resists disorder, "Thermodynamics of Prediction") and with evolution (predictive regulation being the unique contribution of the brain).

Mathematically, well, I'm sure that a purely theoretical probabilist or analyst can pick everything up quickly.

Computationally and psychologically, it's a hot mess. It feels, to me at least, like trying to explain a desktop computer by recourse to saying, "It successively and continually attempts to satisfy its beliefs under the logical model inherent to its circuitry", that is, to compute a tree of NANDS of binary inputs. Is the explanation literally true? Yes! Why? Because it's a universal explanation of the most convenient way we know of to implement Turing-complete computation in hardware.

But bullshit? No, I don't think so.

I wind up putting Friston in the context of Tenenbaum, Goodman, Gershman, etc. Ok, it makes complete sense that the most primitive hardware-level operations of the brain may be probabilistic. We have plenty of evidence that the brain does probabilistic inference on multiple levels, including the seeming "top-down" ones like decision making and motor control. Having evolved one useful mechanism, it makes sense that evolution would just try to put more and more of them together, like Lego blocks, occasionally varying the design slightly to implement a new generative model or inference method within the basic layer or microcircuit doing the probabilistic job.

That's still not a large-scale explanation of everything. It's a language. Telling you the grammar of C or Lisp doesn't teach you the architecture of Half Life 2. Showing that it's a probability model just shows you that you can probably write it in Church or Pyro given enough hardware, and those allow all computably sampleable distributions -- an immensely broad class of models!

On the other hand, if you had previously not even known what C or Turing machines were, and were just wondering how the guns and headcrabs got on the shiny box, you've made a big advance, haven't you?

I think about predictive brain models by trying to parse them this as something like probabilistic programs:

  • What predictions? That is, what original generative model , with what observable variables?
  • What inference methods? If variational, what sort of guide model ? If Monte Carlo, what proposal ?
  • Most importantly, which predictions are updated (via inference), and which are fulfilled (via action)?

The usual way to spot the latter in an active inference paper is to look for an equation saying something like . That denotes control states being sampled from a Boltzmann Distribution whose energy function is the divergence between empirical observations and actual goals.

The usual way to spot the latter in a computational cognitive science paper is just to look for an equation saying something like , which just says that you sample actions which make your goal most likely via ordinary conditionalizing.

Like I said, all this probabilistic mind stuff is a language to learn, which then lets you read lots of neuroscience and cognitive science papers more fluently. The reward is that, once you understand it, you get a nice solid intuition that, on the one hand, some papers might be mistaken, but on the other hand, with a few core ideas like hierarchical probability models and sampling actions from inferences, we've got an "assembly language" for describing a wide variety of possible cognitions.

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T14:45:36.560Z · LW · GW

>I wonder if the conversion from mathematics to language is causing problems somewhere. The prose description you are working with is 'take actions that minimize prediction error' but the actual model is 'take actions that minimize a complicated construct called free energy'. Sitting in a dark room certainly works for the former but I don't know how to calculate it for the latter.

There's absolutely trouble here. "Minimizing surprise" always means, to Friston, minimizing sensory surprise under a generative model: . The problem is that, of course, in the course of constructing this, you had to marginalize out all the interesting variables that make up your generative model, so you're really looking at or something similar.

Mistaking "surprise" in this context for the actual self-information of the empirical distribution of sense-data makes the whole thing fall apart.

>In the paper I linked, the free energy minimizing trolleycar does not sit in the valley and do nothing to minimize prediction error. It moves to keep itself on the dynamic escape trajectory that it was trained with and so predicts itself achieving. So if we understood why that happens we might unravel the confusion.

If you look closely, Friston's downright cheating in that paper. First he "immerses" his car in its "statistical bath" that teaches it where to go, with only perceptual inference allowed. Then he turns off perceptual updating, leaving only action as a means of resolving free-energy, and points out that thusly, the car tries to climb the mountain as active inference proceeds.

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T03:08:45.954Z · LW · GW

Ok, now a post on motivation, affect, and emotion: attempting to explain sex, money, and pizza. Then I’ll try a post on some of my own theories/ideas regarding some stuff. Together, I’m hoping these two posts address the Dark Room Problem in a sufficient way. HEY SCOTT, you’ll want to read this, because I’m going to link a paper giving a better explanation of depression than I think Friston posits.

The following ideas come from one of my advisers who studies emotion. I may bungle it, because our class on the embodied neuroscience of this stuff hasn’t gotten too far.

The core of “emotion” is really this thing we call core affect, and it’s actually the core job of the brain, any biological brain, at all. This is: regulate the states of the internal organs (particularly the sympathetic and parasympathetic nervous systems) to keep the viscera functioning well and the organism “doing its job” (survival and reproduction).

What is “its job”? Well, that’s where we actually get programmed-in, innate “priors” that express goals. Her idea is, evolution endows organisms with some nice idea of what internal organ states are good, in terms of valence (goodness/badness) and arousal (preparedness for action or inaction, potentially: emphasis on the sympathetic or parasympathetic nervous system’s regulatory functions). You can think of arousal and sympathetic/parasympathetic as composing a spectrum between the counterposed poles of “fight or flight” and “rest, digest, reproduce”. Spending time in an arousal state affects your internal physiology, so it then affects valence. We now get one of the really useful, interesting empirical predictions to fall right out: young and healthy people like spending time in high-arousal states, while older or less healthy people prefer low-arousal states. That is, even provided you’re in a pleasurable state, young people will prefer more active pleasures (sports, video gaming, sex) while old people will prefer passive pleasures (sitting on the porch with a drink yelling at children). Since this is all physiology, basically everything impacts it: what you eat, how you socialize, how often you mate.

The brain is thus a specialized organ with a specific job: to proactively, predictively regulate those internal states (allostasis), because reactively regulating them (homeostasis) doesn’t work as well). Note that the brain how has its own metabolic demands and arousal/relaxation spectrum, giving rise to bounded rationality in the brain’s Bayesian modeling and feelings like boredom or mental tiredness. The brain’s regulation of the internal organs proceeds via closed-loop predictive control, which can be made really accurate and computationally efficient. We observe anatomically that the interoceptive (internal perception) and visceromotor (exactly what it says on the tin) networks in the brain are at the “core”, seemingly at the “highest level” of the predictive model, and basically control almost everything else in the name of keeping your physiology in the states prescribed as positive by evolution as useful proxies for survival and reproduction.

Get this wrong, however, and the brain-body system can wind up in an accidental positive feedback that moves it over to a new equilibrium of consistently negative valence with either consistent high arousal (anxiety) or consistent low arousal (depression). Depression and anxiety thus result from the brain continually getting the impression that the body is in shitty, low-energy, low-activity states, and then sending internal motor commands designed to correct the problem, which actually, due to brain miscalibration, make it worse. You sleep too much, you eat too much or too little, you don’t go outside, you misattribute negative valence to your friends when it’s actually your job, etc. Things like a healthy diet, exercise, and sunlight can try to bring the body closer to genuinely optimal physiological states, which helps it yell at the brain that actually you’re healthy now and it should stuff fucking shit up by misallocating physiological resources.

“Emotions” wind up being something vaguely like your “mood” (your core affect system’s assessment of your internal physiology’s valence and arousal) combined with a causal “appraisal” done by the brain using sensory data, combined with a physiological and external plan of action issued by the brain.

You’re not motivated to sit in a Dark Room because the “predictions” that your motor systems care about are internal, physiological hyperparameters which can only be revised to a very limited extent, or which can be interpreted as some form of reinforcement signalling. You go into a Dark Room and your external (exteroceptive, in neuro-speak) senses have really low surprise, but your internal senses and internal motor systems are yelling that your organs say shit’s fucked up. Since your organs say shit’s fucked up, “surprise” is now very high, and you need to go change your external sensory and motor variables to deal with that shit.

Note that you can sometimes seek out calming, boring external sensory states, because your brain has demanded a lot from your metabolism and physiology lately, so it’s “out of energy” and you need to “relax your mind”.

Pizza becomes positively valenced when you are hungry, especially if you’re low on fats and glucose. Sex becomes most salient when your parasympathetic nervous system is dominant: your body believes that it’s safe, and the resources available for action can now be devoted to reproduction over survival.

Note that the actual physiological details here could, once again, be very crude approximations of the truth or straight-up wrong, because our class just hasn’t gotten far enough to really hammer everything in.

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T03:06:23.826Z · LW · GW

Ok, now the post where I go into my own theory on how to avoid the Dark Room Problem, even without physiological goals.

The brain isn’t just configured to learn any old predictive or causal model of the world. It has to learn the distal causes of its sensory stimuli: the ones that reliably cause the same thing, over and over again, which can be modeled in a tractable way.

If I see a sandwich (which I do right now, it’s lunchtime), one of the important causes is that photons are bouncing off the sandwich, hitting my eyes, and stimulating my retina. However, most photons don’t make me see a sandwich, they make me see other things, and trying to make a model complex enough that exact photon behavior becomes parameters instead of noise is way too complicated.

So instead, I model the cause of my seeing a sandwich as being the sandwich. I see a sandwich because there really is a sandwich.

The useful part about this is that since I’m modeling the consistent, reliable, repeatable causes, these same inferences also support and explain my active interventions. I see a sandwich because there really is a sandwich, and that explains why I can move my hands and mouth to eat the sandwich, and why when I eat the sandwich, I taste a sandwich. Photons don’t really explain any of that without recourse to the sandwich.

However, if I were to reach for the sandwich and find that my hands pass through it, I would have to expand my hypothesis space to include ghost sandwiches or living in a simulation. Some people think the brain can do this with nonparametric models: probabilistic models of infinite stuff, of which I use finite pieces to make predictions. When new data comes in that supports a more complex model, I just expand the finite piece of the infinite object that I’m actually using. The downside is, a nonparametric model will always, irreducibly have a bit of extra uncertainty “left over” when compared to a parametric model that started from the right degree of complexity. The nonparametric has more things to be uncertain about, so it’s always a little more uncertain.

How can these ideas apply to the Dark Room? Well, if I go into a Dark Room, I’m actually sealing myself off from the distal causes of sensations. The walls of the room block out what’s going on outside the room, so I have no idea when, for instance, someone might knock on the door. Really knowing what’s going on requires confidence about the distal causal structure of my environment, not just confidence about the proximal structure of a small local environment. Otherwise, I could always just say, “I’m certain that photons are hitting my eyeballs in some reasonable configuration”, and I’d never need to move or do any inferences at all.

It gets worse! If my model of those distal causes is nonparametric, it always has extra leftover uncertainty. No matter how confident I am about the stuff I’ve seen, I never have complete evidence that I’ve seen everything, that there isn’t an even bigger universe out there I haven’t observed yet.

So really “minimizing prediction error” with respect to a nonparametric model of distal causes ends up requiring that I not only leave my room, but that I explore and control as much of the world as possible, at all scales which ever significantly impact my observations, without limit.

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T02:48:52.701Z · LW · GW


I now work in a lab allied to both the Friston branch of neuroscience, and the probabilistic modeling branch of computational cognitive science, so I now feel even more arrogant enough to comment fluently.

I’m gonna leave a bunch of comments over the day as I get the spare time to actually respond coherently to stuff.

The first thing is that we have to situate Friston’s work in its appropriate context of Marr’s Three Levels of cognitive analysis: computational (what’s the target?), algorithmic (how do we want to hit it?), and implementational (how do we make neural hardware do it?).

Friston’s work largely takes place at the algorithmic and implementational levels. He’s answering How questions, and then claiming that they answer the What questions. This is rather like unto, as often mentioned, formulating Hamiltonian Mechanics and saying, “I’m solved physics by pointing out that you can write any physical system in terms of differential equations for its conserved quantities.” Well, now you have to actually write out a real physical system in those terms, don’t you? What you’ve invented is a rigorous language for talking about the things you aim to explain.

The free-energy principle should be thought of like the “supervised loss principle”: it just specifies what computational proxy you’re using for your real goal. It’s as rigorous as using probabilistic programming to model the mind (caveat: one of my advisers is a probabilistic programming expert).

Now, my seminar is about to start soon, so I’ll try to type up a really short step-by-step of how we get to active inference. Let’s assume the example where I want to eat my nice slice of pizza, and I’ll try to type something up about goals/motivations later on. Suffice to say, since “free-energy minimization” is like “supervised loss minimization” or “reward maximization”, it’s meaningless to say that motivation is specified in free-energy terms. Of course it can be: that’s a mathematical tautology. Any bounded utility/reward/cost function can be expressed as a probability, and therefore a free-energy — this is the Complete Class Theorem Friston always cites, and you can make it constructive using the Boltzmann Distribution (the simplest exponential family) for energy functions.

1) Firstly, free-energy is just the negative of the Evidence Lower Bound (ELBO) usually maximized in variational inference. You take a (a model of the world whose posterior you want to approximate), and a (a model that approximates it), and you optimize the variational parameters (the parameters with no priors or conditional densities) of by maximizing the ELBO, to get a good approximation to (probability of hypotheses, given data). This is normal and understandable and those of us who aren’t Friston do it all the time.

2) Now you add some variables to : the body’s proprioceptive states, its sense of where your bones are and what your muscles are doing. You add a , with some conditional to show how other senses depend on body position. This is already really helpful for pure prediction, because it helps you factor out random noise or physical forces acting on your body from your sensory predictions to arrive at a coherent picture of the world outside your body. You now have .

3) For having new variables in the posterior, , you now need some new variables in . Here’s where we get the interesting insight of active inference: if the old was approximated as , we can now expand to . Instead of inferring a parameter that approximates the proprioceptive state, we infer a parameter that can “compromise” with it: the actual body moves to accommodate as much as possible, while also adjusts itself to kinda suit what the body actually did.

Here’s the part where I’m really simplifying what stuff does, to use more of a planning as inference explanation than “pure” active inference. I could talk about “pure” active inference, but it’s too fucking complicated and badly-written to get a useful intuition. Friston’s “pure” active inference papers often give models that would have very different empirical content from each-other, but which all get optimized using variational inference, so he kinda pretends they’re all the same. Unfortunately, this is something most people in neuroscience or cognitive science do to simplify models enough to fit one experiment well, instead of having to invent a cognitive architecture that might fit all experiments badly.

4) So now, if I set a goal by clamping some variables in (or by imposing “goal” priors on them, clamping them to within some range of values with noise), I can’t really just optimize to fit the new clamped model. is really , and has to approximate . Instead, I can only optimize to fit . Actually doing so reaches a “Bayes-optimal” compromise between my current bodily state and really moving. Once already carries a good dynamical model (through time) of how my body and senses move (trajectories through time), changing as a function of time lets me move as I please, even assuming my actual movements may be noisy with respect to my motor commands.

That’s really all “active inference” is: variational inference with body position as a generative parameter, and motor commands as the variational parameter approximating it. You set motor commands to get the body position you want, then body position changes noisily based on motor commands. This keeps getting done until the ELBO is maximized/free-energy minimized, and now I’m eating the pizza (as a process over time).

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T00:59:59.455Z · LW · GW

Actually, here's a much simpler, more intuitive way to think about probabilistically specified goals.

Visualize a probability distribution as a heat map of the possibility space. Specifying a probabilistic goal then just says, "Here's where I want the heat to concentrate", and submitting it to active inference just uses the available inferential machinery to actually squeeze the heat into that exact concentration as best you can.

When our heat-map takes the form of "heat" over dynamical trajectories, possible "timelines" of something that can move, "squeezing the heat into your desired concentration" means exactly "squeezing the future towards desired regions". All you're changing is how you specify desired regions: from giving them an "absolute" value (that can actually undergo any linear transformation and be isomorphic) to giving them a purely "relative" value (relative to disjoint events in your sample space).

This is fine, because after all, it's not like you could really have an "infinite" desire for something finite-sized in the first place. If you choose to think of utilities in terms of money, the "goal probabilities" are just the relative prices you're willing to pay for a certain outcome: you start with odds, the number of apples you'll trade for an orange, and convert from odds to probabilities to get your numbers. It's just using "barter" among disjoint random events instead of "currency".

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T00:51:11.282Z · LW · GW
Can you please link me to more on this? I was under the impression that pascal's mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)

The Complete Class Theorem says that bounded cost/utility functions are isomorphic to posterior probabilities optimizing their expected values. In that sense, it's almost a trivial result.

In practice, this just means that we can exchange the two whenever we please: we can take a probability and get an entropy to minimize, or we can take a bounded utility/cost function and bung it through a Boltzmann Distribution.

Also: As I understand them, utility functions aren't of the form "I want to see X P often and Y 1-P often." They are more like "X has utility 200, Y has utility 150, Z has utility 24..." Maybe the form you are talking about is a special case of the form I am talking about, but I don't yet see how it could be the other way around. As I'm thinking of them, utility functions aren't about what you see at all. They are just about the world. The point is, I'm confused by your explanation & would love to read more about this.

I was speaking loosely, so "I want to see X" can be taken as, "I want X to happen". The details remain an open research problem of how the brain (or probabilistic AI) can or should cash out, "X happens" into "here are all the things I expect to observe when X happens, and I use them to gather evidence for whether X has happened, and to control whether X happens and how often".

For a metaphor of why you'd have "probabilistic" utility functions, consider it as Bayesian uncertainty: "I have degree of belief P that X should happen, and degree of belief 1-P that something else should happen."

One of the deep philosophical differences is that both Fristonian neurosci and Tenenbaumian cocosci assume that stochasticity is "real enough for government work", and so there's no point in specifying "utility functions" over "states" of the world in which all variables are clamped to fully determined values. After all, you yourself as a physically implemented agent have to generate waste heat, so there's inevitably going to be some stochasticity (call it uncertainty that you're mathematically required to have) about whatever physical heat bath you dumped your own waste heat into.

(That was supposed to be a reference to Eliezer's writing on minds doing thermodynamic work (which free-energy minds absolutely do!), not a poop joke.)

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-06T00:42:36.118Z · LW · GW

Honestly, I've just had to go back and forth banging my head on Friston's free-energy papers, non-Friston free-energy papers, and the ordinary variational inference literature -- for the past two years, prior to which I spent three years banging my head on the Josh Tenenbaum-y computational cog-sci literature and got used to seeing probabilistic models of cognition.

I'm now really fucking glad to be in a PhD program where I can actually use that knowledge.

Oh, and btw, everyone at MIRI was exactly as confused as Scott is when I presented a bunch of free-energy stuff to them last March.

Comment by Eli Sennesh (eli-sennesh) on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-05T18:13:46.077Z · LW · GW
The various papers don't all even implement the same model - the free energy principle seems to be more a design principle than a specific model.`

Bingo. Friston trained as a physicist, and he wants the free-energy principle to be more like a physical law than a computer program. You can write basically any computer program that implements or supports variational inference, throw in some action states as variational parameters, and you've "implemented" the free-energy principle _in some way_.

Overall, the Principle is more of a domain-specific language than a single unified model, more like "supervised learning" than like "this 6-layer convnet I trained for neural style transfer."

Are there priors that cannot be represented as utility functions, or vice versa?

No. They're isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal's Mugging doesn't happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, "This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P)."

What explore/exploit tradeoffs do free-energy models lead to, or can they encode any given tradeoff?

The is actually one of the Very Good things about free-energy models: since free-energy is "Energy - Entropy", or "Exploit + Explore", cast in the same units (bits/nats from info theory), it theorizes a principled, prescriptive way to make the tradeoff, once you've specified how concentrated the probability mass is under the goals in the support set (and thus the multiplicative inverse of the exploit term's global optimum).

We ought to be able to use this to test the Principle empirically, I think.

(EDIT: Dear God, why was everything bold!?)