Everyday Lessons from High-Dimensional Optimization

post by johnswentworth · 2020-06-06T20:57:05.155Z · LW · GW · 44 comments

Contents

  Try It and See
  The E-Coli Optimization Algorithm
  Beyond Black Boxes
None
44 comments

Suppose you’re designing a bridge. There’s a massive number of variables you can tweak: overall shape, relative positions and connectivity of components, even the dimensions and material of every beam and rivet. Even for a small footbridge, we’re talking about at least thousands of variables. For a large project, millions if not billions. Every one of those is a dimension over which we could, in principle, optimize.

Suppose you have a website, and you want to increase sign-ups. There’s a massive number of variables you can tweak: ad copy/photos/videos, spend distribution across ad channels, home page copy/photos/videos, button sizes and positions, page colors and styling… and every one of those is itself high-dimensional. Every word choice, every color, every position of every button, header, divider, sidebar, box, link… every one of those is a variable, adding up to thousands of dimensions over which to optimize.

Suppose you’re a startup founder planning your day - just a normal workday. There’s a massive number of variables you could tweak: dozens of people you could talk to, dozens of things you could talk to any of them about, and all the possible combinations of people and topics. There’s emails, code, designs and plans you could write. Every choice is a dimension over which you could optimize.

Point being: real-world optimization problems are usually pretty high dimensional.

Unfortunately, many techniques and intuitions which work well for low-dimensional optimization do not scale up well to higher dimensions. This post talks about some of these problems, and what they look like in real life.

Try It and See

Let’s start with a baseline: try something at random, see how well it works. If it doesn’t work, or doesn’t work better than your current best choice, then throw it out and try something else at random.

In low-dimensional problems, this isn’t a bad approach. Want to decide which brand of soap to use? Try it and see. There just aren’t that many possible choices, so you’ll pretty quickly try all of them and settle on the best.

On the other hand, we probably don’t want to design a bridge by creating a design completely at random, checking whether it works, then throwing it out and trying another design at random if it doesn’t.

In general, the number of possible states/designs/configurations in a space increases exponentially with the number of dimensions. A problem in two or three dimensions, with k choices for each variable, will only have k^2 or k^3 possibilities. A problem in a hundred thousand dimensions will have k^100000 - well in excess of the number of electrons in the universe, even if there’s only two choices per variable. The number of possible bridge designs is exponentially huge; selecting designs at random will not ever find the best design, or anything close to it.

Let’s look at a less obvious example.

From Studies on Slack [LW · GW]:

Imagine a distant planet full of eyeless animals. Evolving eyes is hard: they need to evolve Eye Part 1, then Eye Part 2, then Eye Part 3, in that order. Each of these requires a separate series of rare mutations.

Here on Earth, scientists believe each of these mutations must have had its own benefits – in the land of the blind, the man with only Eye Part 1 is king. But on this hypothetical alien planet, there is no such luck. You need all three Eye Parts or they’re useless. Worse, each Eye Part is metabolically costly [...]

So these animals will only evolve eyes in conditions of relatively weak evolutionary pressure.

See the mistake?

When evolutionary pressure is low, we explore the space of organism-designs more-or-less at random. Now, if we imagine mutations just stepping left or right on a one-dimensional line, with the three eye parts as milestones along that line, then the picture above makes sense: as long as evolutionary pressure is low, we’ll explore out along the line, and eventually stumble on the eye.

But the real world doesn't look like that. Evolution operates in a space with (at least) hundreds of thousands of dimensions - every codon in every gene can change, genes/chunks of genes can copy or delete, etc. The "No Eye" state doesn't have one outgoing arrow, it has hundreds of thousands of outgoing arrows, and "Eye Part 1" has hundreds of thousands of outgoing arrows, and so forth.

As we move further away from the starting state, the number of possible states increases exponentially. By the time we're as-far-away as Whole Eye (which involves a whole lot of mutations), the number of possible states will outnumber the atoms in the universe. If evolution is uniformly-randomly exploring that space, it will not ever stumble on the Whole Eye - no matter how weak evolutionary pressure is.

Point is: the hard part of getting from "No Eye" to "Whole Eye" is not the fitness cost in the middle, it's figuring out which direction to go in a space with hundreds of thousands of directions to choose from at every single step.

Conversely, the weak evolutionary benefits of partial-eyes matter, not because they grant a fitness gain in themselves, but because they bias evolution in the direction toward Whole Eyes. They tell us which way to go in a space with hundreds of thousands of directions.

This same reasoning applies to high-dimensional problem solving in general. Need to design a bridge? A webpage? A profitable company? A better mousetrap? The hard part isn't obtaining enough slack to build the thing; the hard part is figuring out which direction to go in a space with hundreds of thousands of possible directions to choose from at every single step.

The E-Coli Optimization Algorithm

Here’s a simple but somewhat-less-brute-force optimization algorithm:

This algorithm is exactly how e-coli follow chemical gradients to find food. It’s also how lots of real-world problem-solving proceeds: try something, if it helps then do more, if it doesn’t help then try something else. It's babble and prune [LW · GW] with weak heuristics for babble-generation.

This works great in low dimensions. Trying to find the source of a tasty smell or a loud noise? Walk in a random direction, if the smell/noise gets stronger then keep going, otherwise try another direction. There’s only two or three dimensions along which to explore, so you’ll often choose directions which point you close to the source. It works for e-coli, and it works for many other things too.

On the other hand, imagine designing a bridge by starting from a design, changing something at random, checking whether it helps, then keeping the change if it did help or throwing it out otherwise. You might eventually end up with a workable design, but even at best this would be a very slow way to design a bridge.

If you’re a programmer, you’ve probably seen how well it works to write/debug a program by making random changes, keeping them if they help, and throwing them away if not.

In general, e-coli optimization can work as long as there’s a path of local improvements to wherever we want to go. In principle, it’s a lot like gradient descent, except slower: gradient descent always chooses the “locally-best” direction, whereas e-coli just chooses a random direction.

How much slower is e-coli optimization compared to gradient descent? What’s the cost of experimenting with random directions, rather than going in the “best” direction? Well, imagine an inclined plane in n dimensions. There’s exactly one “downhill” direction (the gradient). The n-1 directions perpendicular to the gradient don’t go downhill at all; they’re all just flat. If we take a one-unit step along each of these directions, one after another, then we’ll take n steps, but only 1 step will be downhill. In other words, only ~O(1/n) of our travel-effort is useful; the rest is wasted.

In a two-dimensional space, that means ~50% of effort is wasted. In three dimensions, 70%. In a thousand-dimensional space, ~99.9% of effort is wasted.

Obviously this model is fairly simplistic, but the qualitative conclusion makes at least some sense. When an e-coli swims around looking for food in a three-dimensional puddle of water, it will randomly end up pointed in approximately the right direction reasonably often - there are only three independent directions to pick from, and one of them is right. On the other hand, if we’re designing a bridge and there’s one particular strut which fails under load, then we’d randomly try changing hundreds or thousands of other struts before we tried changing the one which is failing.

This points toward a potentially more efficient approach: if we’re designing a bridge and we can easily identify which strut fails first, then we should change that strut. Note, however, that this requires reasoning about the structure of the system - i.e. the struts - not just treating it as a black box. That’s the big upside of e-coli optimization: we can apply it to black boxes.

Beyond Black Boxes

For low-dimensional systems, try-it-and-see or e-coli optimization work reasonably well, at least in a big-O sense. But in high-dimensional problems, we need to start reasoning about the internal structure of the system in order to solve problems efficiently. We break up the high-dimensional system into a bunch of low-level components, and reason about their interactions with each other. Mathematically, this involves things like:

This sort of reasoning requires an expensive investment in understanding the internal gears of the system [LW · GW], but once we have that understanding, we can use it to achieve better big-O problem-solving performance. It allows us to identify which low-dimensional components matter most, and focus optimization effort on those. We can change the strut which is failing, or the line of code with a bug in it, rather than trying things at random. The more dimensions the system has, the larger the efficiency gain from a gearsy approach.

44 comments

Comments sorted by top scores.

comment by Unnamed · 2020-06-07T02:02:46.343Z · LW(p) · GW(p)
On the other hand, if we’re designing a bridge and there’s one particular strut which fails under load, then we’d randomly try changing hundreds or thousands of other struts before we tried changing the one which is failing.

This bridge example seems to be using a different algorithm than the e coli movement. The e coli moves in a random direction while the bridge adjustments always happen in the direction of a basis vector.

If you were altering the bridge in the same way that e coli moves, then every change to the bridge would alter that one particular strut by a little bit (in addition to altering every other aspect of the bridge).

Whereas if e coli moved in the same way that you describe the hypothetical bridge design, then it would only move purely along a single coordinate (such as from (0,0,0,0,0) to (0,0,0,1,0)) rather than in a random direction.

My intuition is that the efficiency of the bridge algorithm is O(1/n) and the e coli algorithm is O(1/sqrt(n)).

Which suggests that if you're doing randomish exploration, you should try to shake things up and move in a bunch of dimensions at once rather than just moving along a single identified dimension.

Replies from: johnswentworth, david-rein, FordO, Vivek
comment by johnswentworth · 2020-06-07T02:28:49.176Z · LW(p) · GW(p)

Your intuition is exactly correct; the math indeed works out to O(1/sqrt(n)) for a random direction. I actually had a few paragraphs about that in an earlier draft, but cut it because high-dimensional problems in the real world basically never involve a space where Euclidean distance makes sense.

Replies from: Unnamed
comment by Unnamed · 2020-06-08T01:16:52.552Z · LW(p) · GW(p)

Seems like some changes are more like Euclidean distance while others are more like turning a single knob. If I go visit my cousin for a week and a bunch of aspects of my lifestyles shift towards his, that is more Euclidean than if I change my lifestyle by adding a new habit of jogging each morning. (Although both are in between the extremes of pure Euclidean or purely a single knob - you could think of it in terms of the dimensionality of the subspace [LW · GW] that you're moving in.)

And something similar can apply to work habits, thinking styles, etc.

Replies from: johnswentworth
comment by johnswentworth · 2020-06-08T01:53:25.608Z · LW(p) · GW(p)

The relevant aspect which makes a Euclidean distance metric relevant is not "all the dimensions change a bit at once". The relevant aspect is "cost of testing the change is a function of Euclidean distance" - e.g. in the e-coli case, the energy expended during straight-line swimming should be roughly proportional to the distance traveled.

I'm not sure about your visit-the-cousin example. My intuition is that O(1/n) is the default cost to e-coli-style optimization when there's no special structure to the exploration cost which we can exploit, but I don't have a strong mathematical argument for that which would apply in an obvious way to the cousin example. My weak argument would be "absent any special knowledge about the problem structure, we have to treat visiting-the-cousin as an independent knob to turn same as all the other knobs" - i.e. we don't have a prior reason to believe that the effects of the visit are a combination of other (available) knobs.

comment by David Rein (david-rein) · 2020-12-31T06:34:21.619Z · LW(p) · GW(p)

Which suggests that if you're doing randomish exploration, you should try to shake things up and move in a bunch of dimensions at once rather than just moving along a single identified dimension.

 

If you can only do randomish exploration this sounds right, but I think this often isn't the right approach (not saying you would disagree with this, just pointing it out). When we change things along basis vectors, we're implicitly taking advantage of the fact that we have a built-in basis for the world (namely, our general world model). This lets us reason about things like causality, constraints, etc. since we already are parsing the world into a useful basis. 

comment by FordO · 2020-06-14T15:05:26.194Z · LW(p) · GW(p)

Care to elaborate how did you get to the formula O(1/sqrt(n))?

Replies from: Unnamed
comment by Unnamed · 2020-06-14T21:37:59.440Z · LW(p) · GW(p)

The distance between the n-dimensional points (0,0,...,0) and (1,1,...,1) is sqrt(n). So if you move sqrt(n) units along that diagonal, you move 1 unit along the dimension that matters. Or if you move 1 unit along the diagonal, you move 1/sqrt(n) units along that dimension. 1/sqrt(n) efficiency.

If you instead move 1 unit in a random direction then sometimes you'll move more than that and sometimes you'll move less, but I figured that was unimportant enough on net to leave it O(1/sqrt(n)).

comment by Vivek Hebbar (Vivek) · 2022-04-20T01:34:28.392Z · LW(p) · GW(p)

Why would the bridge algorithm be O(1/n)?

comment by Ben Pace (Benito) · 2020-06-11T22:31:09.242Z · LW(p) · GW(p)

Curated. As per usual, for making important points about search very clearly and well. I appreciated the tie-ins with Babble and Prune and Slack a lot. I think I'll be linking to this post a lot when trying to explain why blind empiricism [? · GW] fails.

Replies from: adamzerner
comment by Adam Zerner (adamzerner) · 2020-06-12T00:25:26.561Z · LW(p) · GW(p)

Yeah, I think this post has helped me to understand the issues with blind empiricism. Thinking about it in terms of high-vs-low dimensional is really helpful to me.

Replies from: Benito
comment by Ben Pace (Benito) · 2020-06-12T00:31:30.698Z · LW(p) · GW(p)

Great! I'm glad to hear that.

comment by FactorialCode · 2020-06-06T23:20:41.733Z · LW(p) · GW(p)

Epistemic status: Ramblings

I don't know how much you can really generalise these lessons. For instance, when you say:

How much slower is e-coli optimization compared to gradient descent? What’s the cost of experimenting with random directions, rather than going in the “best” direction? Well, imagine an inclined plane in n dimensions. There’s exactly one “downhill” direction (the gradient). The n-1 directions perpendicular to the gradient don’t go downhill at all; they’re all just flat. If we take a one-unit step along each of these directions, one after another, then we’ll take n steps, but only 1 step will be downhill. In other words, only ~O(1/n) of our travel-effort is useful; the rest is wasted.

In a two-dimensional space, that means ~50% of effort is wasted. In three dimensions, 70%. In a thousand-dimensional space, ~99.9% of effort is wasted.

This is true, but if I go in a spherically random direction, then if my step size is small enough, ~50% of my efforts will be rewarded, regardless of the dimensionality of the space.

How best to go about optimisation depends on the cost of carrying out optimisation, the structure of the landscape, and the relationship between the utility and the quality of the final solution.

Blind guess and check is sometimes a perfectly valid method when you don't need to find a very good solution, and you can't make useful assumptions about the structure of the set, even if the carnality of the possible solution set is massive.

I often don't even think "optimisation" and "dimensionality" are really natural ways of thinking about solving may real world engineering problems. There's definitely an optimisation component to engineering process, but it's often not central. Depending on circumstances, it can make more sense to think of engineering as "satisficing" vs "optimising". Essentially, you're trying to find a solution instead of the best solution, and the process used to solve the problem is going to look vastly different in one case vs another. This is similar to the notions of "goal directed agency" vs "utility maximisation".

In many cases when engineering, you're taking a problem and coming up with possible high level breakdowns of the problem. In the example of bridges, this could be deciding weather to use a cantilever bridge or a suspension bridge or something else entirely. From there, you solve the related sub-problems that have been created by the breakdown, until you've sufficiently fleshed out a solution that looks actionable.

The way you go about this depends on your optimisation budget. In increasing order of costs:

-You might go with the first solution that looks like it will work.

-You'll recursively do a sort of heuristic optimisation at each level, decide on a solution, and move to the next level

-You'll flesh out multiple different high level solutions and compare them.

. . .

-You search the entire space of possible solutions

This is where the whole "slack" thing and getting stuck in local optima comes back, even in high dimensional spaces. In many cases, you're often "committed" to a subset of the solution space. This could be because you've decided to design a cantilever bridge instead of a suspension bridge. It could also be because you need to follow a design you know will work, and X is the only design your predecessors have implemented IRL that has been sufficiently vetted. (This is especially common in aircraft design, as the margins of error are very tight) It could even be because you're comrades have all opted to go with a certain component, and so that component benefits from economies of scale and becomes the best choice even if another component would be objectively better we're it to be mass produced.(I'll leave it as an exercise to the reader to think of analogous problems in software engineering)

In all cases, you are forced to optimise within a subset of the search space. If you have the necessary slack, you can afford to explore the other parts of the search space to find better optima.

Replies from: johnswentworth
comment by johnswentworth · 2020-06-06T23:36:13.776Z · LW(p) · GW(p)

This is true, but if I go in a random direction, then if my step size is small enough, ~50% of my efforts will be rewarded, regardless of the dimensionality of the space.

50% of your steps will go in a positive rather than negative direction, but that completely ignores how much positive progress is made. The problems of high dimensionality are entirely problems of efficiency; "how much improvement?" is the key question.

Depending on circumstances, it can make or sense to think of engineering as "satisficing" vs "optimising".

Satisficing is a strict subset of optimizing, and most of the OP applies just as well to satisficing. The processes used in practice are almost identical, other than the stopping condition: satisficers stop at the first acceptable solution, while optimizers are usually open-ended. The search process itself is usually the same.

Your comments on restrictions of the search space are good examples, but "slack" isn't really the relevant concept here. Really, most of these examples are additional problem constraints, imposed by a group you're working with. It's not that slack would let you explore other solutions (other than the commitment-to-cantilever example) - it's not like you could find better solutions by exploring more, with the same problem constraints in place. Rather, removing those constraints allows better solutions.

Replies from: Gurkenglas
comment by Gurkenglas · 2020-06-07T23:35:29.926Z · LW(p) · GW(p)

how much

If instead of going one step in one of n directions, we go sqrt(1/n) forward or backward in each of the n directions (for a total step size of 1), we try an expected number of twice in order to get sqrt(1/n) progress, for a total effort factor of O(1/sqrt(n)). (O is the technical term for ~ ^^)

Replies from: johnswentworth
comment by johnswentworth · 2020-06-08T00:24:53.721Z · LW(p) · GW(p)

Yup, this is similar to Unnamed's comment. It works in problems where the "cost" of trying a change is determined by Euclidean distance, but once we go beyond that, it falls apart.

Replies from: Gurkenglas
comment by Gurkenglas · 2020-06-08T01:43:07.049Z · LW(p) · GW(p)

Darn it, missed that comment. But how does Euclidean distance fail? I'm imagining the dimensions as the weights of a neural net, and e-coli optimization being used because we don't have access to a gradient. The common metric I see that would have worse high-dimensional behavior is Manhattan distance. Is it that neighborhoods of low Manhattan distance tend to have more predictable/homogenous behavior than those of low Euclidean distance?

Replies from: ErickBall, johnswentworth
comment by ErickBall · 2020-06-12T01:39:15.633Z · LW(p) · GW(p)

It seems like the situation with bridges is roughly analagous to neural networks: the cost has nothing to do with how much you change the design (distance) but instead is proportional to how many times you change the design. Evaluating any change, big or small, requires building a bridge (or more likely, simulating a bridge). So you can't just take a tiny step in each of n directions, because it would still have n times the cost of taking a step in one direction. E. Coli is actually pretty unusual in that the evaluation is nearly free, but the change in position is expensive.

comment by johnswentworth · 2020-06-08T02:12:09.372Z · LW(p) · GW(p)

I don't have a fully-fleshed-out answer, but here's a thought experiment: change the units of all the inputs. Multiply one of them by 12, another by 0.01, another by 10^8, etc.

In a system without any sort of symmetry or natural Euclidean distance metric over the inputs, it's like that transformation has already happened before we started. All the inputs have completely different and arbitrary units. If we take a "one unit" step, in actual practice that will be massive along some of the inputs and inconsequential along others - whereas our sqrt(1/n) strategy relies on the step being effectively similar along each dimension.

The right way to talk about this is probably condition number of local curvature, but I haven't tried to properly formalize it.

Replies from: Gurkenglas
comment by Gurkenglas · 2020-06-08T11:13:29.857Z · LW(p) · GW(p)

In that thought experiment, Euclidean distance doesn't work because different dimensions have different units. To fix that, you could move to the log scale. Or is the transformation actually more complicated than multiplication?

Replies from: johnswentworth, Richard_Kennaway
comment by johnswentworth · 2020-06-09T00:31:44.582Z · LW(p) · GW(p)

Yeah, then we'd just use a more complicated single-variable transformation in the thought experiment.

comment by Richard_Kennaway · 2020-06-08T14:08:13.808Z · LW(p) · GW(p)

You can't take logarithms of non-positive numbers. Even if you know some parameter is non-positive, you still have a free choice of scale, so the problem of comparing scales of different parameters does not go away.

Replies from: Gurkenglas
comment by Gurkenglas · 2020-06-08T23:37:15.507Z · LW(p) · GW(p)

You can, actually. ln(5cm)=ln(5)+ln(cm), and since we measure distances, the ln(cm) cancels out. The same way, ln(-5)=ln(5)+ln(-1). ln(-1) happens to be pi*i, since e^(pi*i) is -1.

Replies from: dmitrii-zelenskii-1
comment by Дмитрий Зеленский (dmitrii-zelenskii-1) · 2023-11-02T08:53:38.740Z · LW(p) · GW(p)

If you allow complex numbers, comparison "greater than/less than" breaks down.

Replies from: Gurkenglas
comment by Gurkenglas · 2023-11-03T00:13:20.082Z · LW(p) · GW(p)

Since we measure distances, the ln(-1) cancels out. Also, we are comparing how well each setting of the parameters performs; for that, whether the parameters are complex-valued doesn't matter.

Replies from: dmitrii-zelenskii-1
comment by Дмитрий Зеленский (dmitrii-zelenskii-1) · 2023-11-07T22:10:52.885Z · LW(p) · GW(p)

Erm, I think you're getting mixed up between comparing parameters and comparing the results of applying some function to parameters. These are not the same, and it's the latter that become incomparable.

(Also, would your algorithm derive that ln(4+3i)=ln(5) since |4+3i|=|5|? I really don't expect the "since we measure distances" trick to work, but if it does work, it should also work on this example.)

Replies from: Gurkenglas
comment by Gurkenglas · 2023-11-07T22:57:15.087Z · LW(p) · GW(p)

As far as I saw, you were getting mixed up on that. We never compare the parameter-vectors for being greater than/less than each other; they aren't ordered.

(No, if some parameter started out with such values as 4+3i or 5, the ln transformation would not equate them. But multiplying both by e^0.01 would add 0.01 to both logarithms, regardless of previous units.)

Replies from: dmitrii-zelenskii-1
comment by Дмитрий Зеленский (dmitrii-zelenskii-1) · 2023-11-07T23:29:43.978Z · LW(p) · GW(p)

I think at this point we should just ask @johnswentworth [LW · GW] which one of us understood him correctly. As far as I see, we measure a distance between vectors, not between individual parameters, and that's why this thing fails.

comment by DirectedEvolution (AllAmericanBreakfast) · 2020-06-13T17:01:59.889Z · LW(p) · GW(p)

Simulated annealing might be a counterexample. There's a more conversational description of it in this 80,000 Hours podcast transcript, and here's a video on it. By starting with evaluating entirely random designs and no optimizing tweaks, then slowly decreasing the randomness of new designs while increasing the optimizing tweaks, you can do a very good job of finding t̶h̶e̶ ̶g̶l̶o̶b̶a̶l̶ ̶o̶p̶t̶i̶m̶u̶m̶ a workable solution.

This seems to work well if you have the ability to create and evaluate an enormous number of designs, at extremely low cost relative to the value of success, and high accuracy. That's possible with computer chips, and my guess is that it's possible these days with bridges and many other complex machines.

Perhaps the work of modeling internal structure is an attempt to make a problem more amenable to the SA algorithm. Examples:

  • Online dating attempts to reduce dating preferences to a questionnaire score, to more accurately and cheaply evaluate many candidates. The problem, of course, is that it's not very accurate and doesn't do a great job of increasing dating options for many people, so it's not a game changer.
  • We play with a random thought until it turns into a business model. We break a business model down into job descriptions. We break candidates down into resumes. We break career paths down into metrics. Then we jiggle things around - switching job functions, hiring and firing and quitting - until we find a workable model, at which point we exploit it for a while, forming a workable corporation. Or we learn enough about our industry to have a new idea, quit and launch a startup. Sometimes, a merger or a CEO with vision disrupts a corporation and causes it to undergo a major reorganization, explorating a new point on the optimization graph.
  • Scientific research starts with exploration: playing around with intellectual models of a problem, with lots of a priori guesses at their plausibility. The exploitation phase begins when a likely model is subjected to optimization via an iterative process of experiment and refinement by many scientists.
    • There is an important difference here from simulated annealing. In normal SA, designs don't fail, they are just not as good as others. But in scientific research, theories do fail, when they can't account for replicable data. So in normal SA, you have to gradually decrease exploration on a timer. In scientific research, you calibrate exploration in response to theory failure.
    • Simulated annealing depends on slack - the willingness to have scientists pursuing an idea even if it seems like a failed theory. So perhaps we are being too harsh when we castigate scientists for being unwilling to abandon a "disproven" idea or for "throwing papers over the wall."
Replies from: johnswentworth
comment by johnswentworth · 2020-06-13T17:41:40.120Z · LW(p) · GW(p)

Simulated annealing is subject to exactly the same sorts of problems as the algorithms discussed in the post. It's effectively an e-coli-style algorithm which varies its step size.

I think there's a lot of popular misunderstanding of SA. People talk about it as though it's going to find the global optimum, which it isn't. If you need an actual global optimum, then SA isn't any better than brute force (in a big-O sense). People also talk about it approximating a global optimum, which it does marginally better than naive gradient descent, but it still requires qualitatively similar conditions to gradient descent in order to actually work well. It can handle a few bumps and hills, but at the large scale, the objective function still needs to be shaped roughly like a large bowl in order for SA to work well; there has to be a broad basin of attraction for the optimum. If the best points are narrow, deep wells with big hills around them, then SA isn't going to help any.

Basically, it's gradient descent with less noise-sensitivity.

None of this is to say it's a bad method; gradient descent with less noise-sensitivity is great! But it's not going to find global optima in high-dimensional non-convex problems without exponential amounts of compute.

Replies from: AllAmericanBreakfast
comment by DirectedEvolution (AllAmericanBreakfast) · 2020-06-13T18:05:00.299Z · LW(p) · GW(p)

I'm sure you understand SA better than I do. I won't argue the jargon.

And yet the tone of your post, plus the examples you use, make it sound like you're saying that SA would not be a good choice for designing something like a bridge. The example of SA's usefulness in designing computer chips seems to contradict that example.

If the intuition you're trying to defend is "for complex problems, you need to precisely model and plan your design to achieve a workable solution," then I think SA is a strong counterargument.

If instead, you're arguing that "for complex problems, only a precise internal model of the problem will achieve an optimal solution," then I'd say you're right, but that arriving at that internal model is what an SA-like process is for. As it is written, defining the problem is most of the work. I think that's what's involved when "we break up the high-dimensional system into a bunch of low-level components, and reason about their interactions with each other" or make an "expensive investment in understanding the internal gears of the system."

If I'm completely missing the point of your post, could you give a one-sentence tl;dr of the primary claim you're making?

Replies from: johnswentworth
comment by johnswentworth · 2020-06-13T19:47:31.130Z · LW(p) · GW(p)

Ah, I see what you're saying. This definitely gets at some interesting points.

First point is efficiency. As you mentioned in the top-level comment, SA "seems to work well if you have the ability to create and evaluate an enormous number of designs, at extremely low cost relative to the value of success, and high accuracy". So SA would be a bad way for a human (or group of humans) to design a bridge, but if you have a cheap and accurate bridge-simulator, then it's not unreasonable. There are probably much more efficient algorithms which leverage the problem structure more, but that requires writing complicated code.

Note that there's nothing particularly special about SA in this regard - most black-box optimization methods work in largely the same cases.

(Ultimately, one of the main areas I want to apply the ideas in the OP to is better methods for human thinking - i.e. things humans can do to learn/problem-solve more quickly/successfully. So that's largely the use-case I have in mind. Obviously computational optimization is a whole different game with different constraints, and we do use black-box optimizers in many computational problems with considerable success.)

Second point is problem specification. Once we move from talking about humans designing things to computers designing things (via black-box optimization), problem specification becomes the big bottleneck. We need both (a) a specification of the environment/problem-space, e.g. a physics simulator for bridge-design, and (b) specification of the objective, e.g. a function to compute whether a bridge has "failed" in a simulation and how much the bridge "costs". The challenge is that both of these pieces need to work correctly over the entire (exponentially large) space of parameters which the optimizer could test. Obviously we can't test correctness of the simulator and objective by checking it on every possible input, so we need some other way to check that it's correct - which brings us back to gears.

In (certain parts of) chip design, this problem has turned out to be tractable. In that case, the "objective" is specified by humans using gearsy reasoning to design a chip, then black-boxing that chip design and requiring the optimizer to produce something functionally identical to the design (but more efficient by various metrics).

(Here again, gears matter mainly for the part of the thinking-work which the humans need to do.)

As to a one-sentence summary of my main claim: in complex high-dimensional problems, you (a human) need to break the system up into sparsely-coupled low-dimensional subsystems, and leverage that representation, in order to achieve a workable solution.

Replies from: AllAmericanBreakfast
comment by DirectedEvolution (AllAmericanBreakfast) · 2020-06-13T20:22:57.079Z · LW(p) · GW(p)

What I'm pushing on is the nature of "gearsy reasoning." I think that it operates via SA.

Sometimes SA produces solutions to that provably calculate the global optimum. For example, mathematicians exploring geometry have been able to prove that if you're trying to build a tank with the maximum volume:surface area ration, make it in the shape of a sphere.

Other times, it just works well enough. We use SA to invent a bunch of rivet and screw designs, concrete mixtures, beam angles. We testing the component solutions on other problems, try them out on model bridges, real bridges, computer simulations. Expert bridge builders spend their time exploring and exploiting heuristics for the details of the problems they routinely face. Expert bridge builder hirers explore and exploit options for who to hire. Somehow, bridges get built and rarely collapse.

So it's SA all the way down. All reasoning boils down to SA, and it works as well as it possibly can.

It sounds like you might be arguing that there is something fundamentally different going on when we employ "gearsy reasoning." If so, that is my point of disagreement. The account I've just given seems to me like an intuitive and accurate description of how I've solved every problem I've ever worked on, and how it appears to me that other people work as well.

Even in a case like dating, where the ability to explore is quite limited, I just wind up using a crappy version of SA. I find myself attracted to someone; we date and try to deepen/exploit our relationship; we either break up and try again with someone else, or we keep dating/exploiting indefinitely. It works well enough.

Replies from: johnswentworth
comment by johnswentworth · 2020-06-14T00:49:11.793Z · LW(p) · GW(p)

It sounds like you're using "simulated annealing" as synonymous with something like "mix exploration and exploitation"? I'm not entirely sure what you're using it to point to, but it definitely sounds like you're using it as a metaphor for some particular aspect of search algorithms, rather than talking about the actual math of simulated annealing. What are you trying to emphasize?

Anyway, "mix exploration and exploitation" is not a search algorithm by itself. It's a general trick which can be used in conjunction with a wide variety of search algorithms. This also applies to simulated annealing specifically: it's a technique used in conjunction with other search algorithms. (E.g. scipy's implementation takes an argument for which optimizer to use in the minimizer_kwargs.)

It sounds like you might be arguing that there is something fundamentally different going on when we employ "gearsy reasoning." If so, that is my point of disagreement. 

I am definitely arguing that gearsy reasoning does something fundamentally different. It does use black-box optimizers, but it uses them locally, operating on individual gears. Black-box optimization works badly on high-dimensional systems, so we decompose such systems into coupled low-dimensional systems, and then use black-box optimizers on the low-dimensional components. The interesting part of gearsy reasoning is how to account for the coupling, which is where things like causality and propagation come in.

The account I've just given seems to me like an intuitive and accurate description of how I've solved every problem I've ever worked on, and how it appears to me that other people work as well.

The account you've given may well be accurate, but it's an inherently incomplete account. Simulated annealing - or whatever you're trying to point to with that phrase - is not itself a search algorithm. Even if all problem solving involves some mix of exploration and exploitation, that still doesn't tell us which direction to go in a thousand-dimensional space when "exploiting". Likewise, exploration clearly isn't completely random - so how do we decide which directions to explore? That's what gears are for.

Replies from: AllAmericanBreakfast
comment by DirectedEvolution (AllAmericanBreakfast) · 2020-06-15T02:13:58.034Z · LW(p) · GW(p)

I'm using SA closer to its original meaning as a metaphor for a process in chemistry rather than its precise mathematics as an algorithm. We start with lots of random trial and error, then dial it down until we just exploit. I do think that this, not just "some explore and some exploit" is how black-box optimization works in practice. It sounds like we agree that decomposition + explore/exploit is a major component of rationality.

I believe that the steering/search algorithm is also developed and applied via black box optimization. It sounds like you disagree.

Here's an example of my concept of how explore/exploit can fully account for the success of a high-dimensional optimization problem.

Jack is choosing a hobby (explore). He loves fresh bread and decides to learn to bake it. He tries for a while (exploit), but eventually decides it's not very fun, so he picks a different hobby (explore). This time, he settles on Indian food, also because it's delicious. This hobby sticks, and he learns dish after dish, sometimes choosing his favorite dishes, other times picking randomly out of a cookbook (exploit).

The question is whether Jack is doing something fundamentally different from exploring and exploiting when he chooses how to pick which bread to bake or which Indian food to cook? To pick dishes, Jack might take suggestions from his friend Ananya, an experienced Indian cook; cook the dishes he's enjoyed at restaurants before; or pick randomly out of a cookbook.

But where did these decision procedures come from?

Well, we've been trying each of them for millennia. Each depend on exploitation (of friends, restaurants, cookbooks), which in turn are the result of exploration (going to parties, searching on Yelp, looking up reviews), and so on ad infinitum.

I don't think that any other mysterious search algorithm is needed to explain how Jack optimizes his Indian food cooking hobby. At some level where there is no decision procedure to provide guidance (how does Ananya choose from among the dishes she thinks would be best to suggest to Jack?), we use some arbitrary selection method, such as random choice or a pointless rule like "choose the first thing that comes to mind."

Can you describe where in this story (or the gaps within it) explore/exploit couldn't suffice? Alternatively, is this not a high-dimensional optimization problem of the kind you had in mind?

Replies from: johnswentworth, lcmgcd
comment by johnswentworth · 2020-06-15T05:35:56.550Z · LW(p) · GW(p)

I had a long conversation with my girlfriend at one point about picking hobbies. She used to just try things at random, and I was encouraging her to be more strategic. We started by making a list of things she liked/didn't like about previous hobbies she'd tried. Then, we turned those into criteria for potential hobbies - things like "should involve exercise", "should involve producing something", "should be good to show off to friends", or "should grow in skill over time". Then we considered broad classes of activities which addressed specific criteria, like "sports" or "art". We tried to be systematic about generating the activity-classes - e.g. starting from "what possible kinds of things can we produce?" rather than just generating random ideas and checking them against the constraints.

This sort of reasoning doesn't really look like uniformly-random exploration followed by exploitation. In particular, the pattern I want to highlight in this case is:

  • identify constraints (i.e. the criteria)
  • search by backtracking from the constraints, rather than trying things uniformly-randomly

That's not incompatible with exploration+exploitation (blackbox search can be used in sub-steps), but it is an additional component which is not implied by exploitation/exploration alone. It avoids uniformly-random exploration, and instead steers toward more-likely-useful possibilities. The constraints tell us which-way-to-go in our high-dimensional search space.

This part in particular is interesting:

But where did these decision procedures come from?

Well, we've been trying each of them for millennia. Each depend on exploitation (of friends, restaurants, cookbooks), which in turn are the result of exploration (going to parties, searching on Yelp, looking up reviews), and so on ad infinitum.

One of the hobbies my girlfriend eventually picked up was baking, and she recently wrote a post [LW · GW] about how baking is not about following opaque decision procedures handed down through the millennia. That post has some great examples of gearsy reasoning; I recommend checking it out.

Replies from: AllAmericanBreakfast
comment by DirectedEvolution (AllAmericanBreakfast) · 2020-06-15T22:56:29.479Z · LW(p) · GW(p)

Her baking post was in the back of my mind when I typed my last comment! I thought it was great. I had the same reaction to her post as I had to yours.

We can and often do use constraints to guide our optimization process. Here are some examples:

  • When choosing a hobby, we can brainstorm what we want to get out of our hobby before we consider concrete activities that fit the bill.
  • When solving on a math problem, we can invent some rules for how to format our work so that it'll be legible later.
  • Yelp implicitly considers factors like price, location, quality, and type of food before giving us options.
  • When building a bridge, we're going to use standardized materials, tools, and techniques, not invent our own.

These constraints might be selected through intuition, or through the design of meta-constraints that help us select which object-level constraints to use.

However, the idea behind each constraint had to be invented in the first place, and in any concrete decision, we have to improvise its application.

So how does a constraint get invented in the first place? How do we decide which ones to use for a given decision? And how do we decide how to apply them? Probably with more constraints, but then the same questions arise about those. At some point, we are just picking constraints at random from those that occur to us, choosing randomly from among options that fit them, and seeing how we feel about the result.

We associate good feelings with individual or combinations of constraints that have worked out well in the past or that we've been taught to use, so they're more likely to occur to us in the future.

So the process by which we decompose a problem; invent, combine, and apply constraints; or decide how to evaluate a decision; is itself a random process. Plan it all out, and you move the randomness one meta-level higher. That's not to say it's pointless. Planning and meta-level reasoning is often a powerful way to make decisions. It's just not fundamentally different from object-level explore/exploit, and it runs into similar problems and ambiguities, just on a more abstract level.

Replies from: johnswentworth
comment by johnswentworth · 2020-06-16T00:17:11.105Z · LW(p) · GW(p)

I think your description here is basically right, except for the very last sentence. If we just had one meta-level, then yeah, it would end up basically the same as explore/exploit but at a more abstract level. But by recursively applying the trick, it ends up working well on a much wider class of problems than explore/exploit by itself.

The really neat thing is that recursively applying this particular trick does not require moving up an infinite ladder of meta-levels. There are only two levels, and moving "up" alternates between them.

To see why, consider what it looks like on a math problem where we're trying to prove some theorem:

  • A "constraint" would be something like a conditional counterexample - i.e. "for any proof which uses strategy X, here's a counterexample"
  • When searching for such counterexamples, our "constraints" are partial proofs - i.e. if we have a proof assuming X, then any counterexample must leverage/assume the falsehood of X.

So when we use constraints to guide our search for proofs, we look for counterexamples. When we use constraints to guide our search for counterexamples, we look for proofs.

More visually, consider solving a maze. We can find constraints like this:

When we're searching for that constraint, what are the meta-constraints to that search? Well, they're paths. A constraint is a path of connected walls; a meta-constraint is a path of connected spaces.

In general, we're switching between solving an optimization problem and solving the dual problem.

Of course, we may also use some random search at any given step before switching.

comment by lemonhope (lcmgcd) · 2021-05-21T07:19:50.508Z · LW(p) · GW(p)

Mm i think "the higher process X is produced by SA" is different drill "the higher process X is implementing SA"

comment by jacob_cannell · 2023-11-07T22:26:37.658Z · LW(p) · GW(p)

How much slower is e-coli optimization compared to gradient descent? What’s the cost of experimenting with random directions, rather than going in the “best” direction?

There was some post a bit ago how evolutionary optimization is somehow equivalent to SGD, and I was going to respond no, that can't be, as it steps in mostly random directions, so at best it's equivalent to a random forward gradient method: completely different (worse) asymptotic convergence with respect to parameter dimension as you discuss. There's a reason why SGD methods end up using large batching/momentum to smooth out gradient noise before stepping.

Replies from: johnswentworth
comment by johnswentworth · 2023-11-08T01:52:14.267Z · LW(p) · GW(p)

I do still expect that evolutionary optimization is basically-similar to SGD in terms of what kinds of optima they find, and more generally in terms of what their trajectories look like at a course-grained scale. But algorithmically, yeah, SGD should follow that trajectory a lot faster, especially as dimensionality goes up.

comment by lemonhope (lcmgcd) · 2021-05-21T06:55:25.677Z · LW(p) · GW(p)

You didn't mean to but you've written a critique of using RCT's to figure out medicine.

Replies from: johnswentworth
comment by johnswentworth · 2021-05-21T14:04:44.416Z · LW(p) · GW(p)

Oh I very much mean to do that.

The purpose of an RCT is to prove something works after we already have enough evidence to pay attention to that particular hypothesis at all. Since the vast majority of things (in an exponentially large space) do not work, most of the bits-of-evidence are needed just to "raise the hypothesis from entropy" - i.e. figure out that the hypothesis is promising enough to spend the resources on an RCT in the first place. The RCT provides only the last few bits of evidence, turning a hunch into near-certainty; most of the bits of evidence must have come from some other source already. It's exactly the same idea as Einstein's Arrogance [LW · GW].

comment by FordO · 2020-06-14T15:30:45.074Z · LW(p) · GW(p)

Do you count gradient descent as a blackbox optimization, and isn't backpropagation guided by gradient (at least in ANN)?

Replies from: johnswentworth
comment by johnswentworth · 2020-06-14T16:37:20.687Z · LW(p) · GW(p)

Gradient descent requires somehow computing the gradient. There are both blackbox and non-blackbox ways of doing that, and the blackbox methods are much more expensive.

Backpropagation is the main non-blackbox method for computing the gradient; it looks at the steps used to compute the function and propagates information backward through those steps. If the function takes m steps to compute, backprop takes O(m) to compute the gradient.

The main blackbox method for computing gradients is to evaluate the function itself, then evaluate it with a small change in one coordinate, then evaluate it with a small change in the next coordinate, and so forth - one evaluation for each coordinate. If the function takes m steps to compute and has n inputs, then this takes O(mn).