Beren's "Deconfusing Direct vs Amortised Optimisation"

post by DragonGod · 2023-04-07T08:57:59.777Z · LW · GW · 10 comments

Contents

    Preamble
  Two Approaches to Optimisation
    Direct Optimisers
    Amortised Optimisers
  Differences
  Some Commentary
None
10 comments

Preamble

I heavily recommend @beren [AF · GW]'s "Deconfusing Direct vs Amortised Optimisation [LW · GW]". It's a very important conceptual clarification [LW(p) · GW(p)] that has changed how I think about many issues bearing on technical AI safety.

Currently, it's the most important blog post I've read this year. 

This sequence (if I get around to completing it) is an attempt to draw more attention to Beren's conceptual frame and its implications for how to think about issues of alignment and agency [LW · GW].

This first post presents a distillation of the concept, and subsequent posts explore its implications.


Two Approaches to Optimisation

Beren introduces a taxonomy categorising intelligent systems according to the kind of optimisation they are performing. I think it's more helpful to think of these as two ends of a spectrum as opposed to distinct discrete categories; sophisticated real world intelligent systems (e.g. humans) appear to be a hybrid of the two approaches.

 

Direct Optimisers

Naively, direct optimisers can be understood as computing (an approximation of)  (or ) for a suitable objective function during inference.

 

Amortised Optimisers

Naively, amortised optimisers can be understood as evaluating a (fixed) learned function; they're not directly computing  (or ) for any particular objective function during inference.


Differences

AspectDirect OptimizationAmortized Optimization
Problem SolvingComputes optimal responses "on the fly"Evaluates the learned function approximator on the given input
Computational ApproachSearches through a solution spaceLearns a function approximator
Runtime CostHigher, as it requires in-depth search for a suitable solutionLower, as it only needs a forward pass through the function approximator
Scalability with ComputeScales by expanding search depthScales by better approximating the posterior distribution
ConvergenceIn the limit of arbitrary compute, the system's policy converges to  of the appropriate objective functionIn the limit of arbitrary compute, the system's policy converges to the best description of the training dataset
PerformanceMore favourable in "simple" domainsMore favourable in "rich" domains 
Data EfficiencyLittle data needed for high performance (e.g. an MCTS agent can attain strongly superhuman performance in Chess/Go given only the rules and sufficient compute)Requires (much) more data for high performance (e.g. an amortised agent necessarily needs to observe millions of chess games to learn skilled play)
GeneralizationDependent on search depth and computeDependent on the learned function approximator/training dataset
Alignment FocusEmphasis on safe reward function designEmphasis on reward function and dataset design
Out-of-Distribution BehaviorCan diverge arbitrarily from previous behaviorConstrained by the learned function approximator
ExamplesAIXI, MCTS, model-based RLSupervised learning, model-free RL, GPT models

Some Commentary

  1. ^

    Or strategies, plans, probabilities, categories, etc.; any "output" of the system.

  2. ^

    Beren:

    I would add that this function is usually the solution to the objective solved by some form of direct optimiser. I.e. your classifier learns the map from input -> label. 

10 comments

Comments sorted by top scores.

comment by Max H (Maxc) · 2023-04-05T16:56:11.937Z · LW(p) · GW(p)

The limitations of direct optimisation in rich environments seem complexity theoretic, so better algorithms won't fix them

This seems questionable. Humans are pretty good at "direct optimization" when they want to be (a point mentioned in the original post [LW · GW]).

And it seems straightforward to construct artificial systems that behave even more like "direct optimizers" than humans, even if some or all of the component pieces of those systems are made out of function-approximators. Mu Zero is a good example; I sketched what a "real world" version might look like here [LW · GW].

To me, "amortized optimization" seems like just one tool in the toolbox of actual optimization, which is about choosing actions that steer towards outcomes.

Replies from: DragonGod
comment by DragonGod · 2023-04-05T19:38:55.605Z · LW(p) · GW(p)

Humans aren't pure direct optimisers, though I think there's a point about using abstractions to simplify a problem or translate it to a simple domain that's more amenable to direct optimisation.

comment by David Johnston (david-johnston) · 2023-04-05T21:57:30.178Z · LW(p) · GW(p)

A transformer at temp 0 is also doing an argmax. I’m not sure what the fundamental difference is - maybe that there’s a simple and unchanging evaluation function for direct optimisers?

Alternatively, we could say that the class of approximators all differ substantially in practice from direct optimisation algorithms. I feel like that needs to be substantiated, however. It is, after all, possible to learn a standard direct optimisation algorithm from data. You could construct a silly learner that can implement either the direct optimisation algorithm or something else random, and then have it pick whichever performs better on the data. It might also be possible with less silly learners.

comment by Roman Leventov · 2023-04-05T18:04:54.337Z · LW(p) · GW(p)

In humans and other animals, the distinction between direct and amortised optimisation manifests as planning-as-inference vs. deontic action (Constant et al., 2021, https://www.frontiersin.org/articles/10.3389/fpsyg.2020.598733/full):

Deontic actions are actions for which the underlying policy has acquired a deontic value; namely, the shared, or socially admitted value of a policy (Constant et al., 2019). A deontic action is guided by the consideration of “what would a typical other do in my situation.” For instance, stopping at the red traffic light at 4 am when no one is present may be viewed as such a deontically afforded action.

This also roughly corresponds to the distinction between representationalism and dynamicism.

comment by Jonas Hallgren · 2023-04-05T17:18:06.842Z · LW(p) · GW(p)

When reading this, I have a question of where between a quantiliser and optimiser amortised optimisation lies. Like, how much do we run into maximised VNM-utility style problems if we were to scale this up into AGI-like systems?

My vibe is that it seems less maximising than a pure RL version would, but then again, I'm not certain to what extent optimising for function approximation is different from optimising for a reward.

Replies from: DragonGod
comment by DragonGod · 2023-04-05T19:37:30.962Z · LW(p) · GW(p)

I think amortised optimisation doesn't lie on the same spectrum as "quantiliser - (direct) optimiser" but is another dimension entirely. I.e. your question is like asking: "where between the x and y axis does the line for the z axis lie"?

Amortised optimisation is just a fundamentally different approach where we learn to approximate some function from a dataset and then just evaluate the learned function.

The behaviour of the amortised policy may look similar to a direct optimiser on the training distribution, but diverge arbitrarily far on another distribution where the correlation between the learned policy and a particular objective breaks down.

comment by mruwnik · 2023-04-07T13:18:34.204Z · LW(p) · GW(p)

Is this sort of the difference between System 1 and 2 thinking?

Replies from: DragonGod
comment by DragonGod · 2023-04-07T13:25:31.562Z · LW(p) · GW(p)

I did say that in the OP yes.

Replies from: mruwnik
comment by mruwnik · 2023-04-07T13:32:25.902Z · LW(p) · GW(p)

Right. That's on me for skimming the commentary section...

comment by carboniferous_umbraculum (Spencer Becker-Kahn) · 2023-04-07T09:46:37.910Z · LW(p) · GW(p)

This is a very strong endorsement but I'm finding it hard to separate the general picture from RFLO:


mesa-optimization occurs when a base optimizer...finds a model that is itself an optimizer,

where 

a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

i.e. a mesa-optimizer is a learned model that 'performs inference' (i.e. evaluates inputs) by internally searching and choosing an output based on some objective function.

Apparently a "direct optimizer" is something that "perform[s] inference by directly choosing actions[1] [LW(p) · GW(p)] to optimise some objective function". This sounds almost exactly like a mesa-optimizer?