Can we infer the search space of a local optimiser?

lblack

Can we infer the search space of a local optimiser?

post by Lucius Bushnaq (Lblack) · 2025-02-03T10:17:01.661Z · LW · GW · 3 comments

This is a question post.

  Setup
  Example of what this could look like
  Summary
None
  Answers
    9 Alexander Gietelink Oldenziel
None
3 comments

Suppose we have a system that we suspect is some form of local optimiser. Maybe it's gradient descent training a neural network. Maybe it’s a neural network doing in-context learning. Maybe it's a mind, because we guess that the operation of minds in general is to an extent well-described as local movement in some 'abstraction space' a lot of the time. Maybe it’s something else.

Is there, on paper, a general way:

To check whether our guess that the system is a local optimiser is correct?
To find the space in which the optimiser is taking local steps?

Setup

Concretely, say a system has a state vector , which we can see. $s$ evolves over time steps $t$ , so let’s call it $s_{t}$ . Our guess is that the system evolves $s_{t} \to s_{t + 1}$ through some local optimisation method. But it’s not really local in $s$ -space. It’s local in some other space we don’t know about. Let’s call the state vector in that hidden space we don’t know about $h_{t} \in H$ , where $H$ is some space. $H$ seems guaranteed to be a topological space, else the idea of a local search in it wouldn't make much sense. Maybe in a lot of cases we can also assume that it is a metric space? Not sure about that one, but a lot of the local search algorithms I can think of seem to have some notion of distance between points in the space.

So, overall the system maps $s_{t} \to s_{t + 1}$ as $s_{t} \to h_{t} \to h_{t + 1} \to s_{t + 1}$ , where the transformations $s_{t} \to h_{t}$ and $h_{t + 1} \to s_{t + 1}$ can be complicated and nonlinear.

$h_{t} \to h_{t + 1}$ is assumed to be a step in some local optimisation method. That is to say, there is some evaluation function $f (h_{t})$ that the system is trying to minimise, and it tries to do this minimisation via something like a genetic algorithm, gradient descent, Adam, or something else that makes use of local information to take one step in the space $H$ per time step $t$ . The evaluation function $f (h)$ should have the kind of loss landscape a local optimiser could reasonably navigate. No parity learning problems [LW · GW] or anything like that. It also shouldn’t be completely trivial to find the minimum of $f (h)$ . I.e. if the optimiser is using Newton’s method, $f (h)$ shouldn’t just be quadratic everywhere, else there isn’t much of a search trajectory for us to look at because the system always finds the minimum in one step.

Now, what I’m wondering about is this: Is there some characteristic attribute of the trajectories local optimisations take in their search spaces which we can use to figure out the map $s_{t} \to h_{t}$ with reasonable confidence, provided we have enough states $s_{t}$ from different searches to look at?

Example of what this could look like

I'm not claiming the following would actually work, I'm just trying to point to the sort of thing my brain is thinking about by giving a concrete example.

In a lot of cases I can think of, neighbouring points in the time series of a local search will tend to be close to each other in terms of Euclidean distance in the search space. So, we could try to find a map $s_{t} \to h_{t}$ such that $δ_{t} := | | h_{t} - h_{t + 1} | |_{2}^{2}$ will tend to be small. We could do this by training an autoencoder with a deep and expressive architecture on a loss of

Reconstructing the original system outputs. So, learn some map $s_{t} \to {h^{'}}_{t} \to s_{t + 1}$ .
Minimising some term like $\frac{σ (δ (h^{'}))}{E (δ (h^{'}))}$ , with $δ (h^{'} (t)) := | | h_{t}^{'} - h_{t + 1}^{'} | |_{2}^{2}$ , where the standard deviation $σ (δ (h^{'}))$ and expectation $E (δ (h^{'}))$ are taken over the batch.

And then maybe the map $s_{t} \to {h^{'}}_{t}$ we can now explicitly see in the autoencoder would tend to roughly match the map $s_{t} \to h_{t}$ we're looking for, up to trivial transformations like affine transforms. And if the autoencoder training can't converge with good loss, then maybe that means the original system wasn't a local optimiser, or at least not a local optimiser taking small steps in a Euclidean space.

Summary

Should we expect local search spaces to often be describable as metric spaces, for the kind of local searches we might often see in real life? Local searches run by in-context learning algorithms or other minds, for example.
If a local optimiser maps $s_{t} \to h_{t} \to h_{t + 1} \to s_{t + 1}$ , with $h_{t} \in H$ being the states in the local search space, can we infer the map $s_{t} \to h_{t}$ from seeing enough states $s_{t}$ ,^[1] by leveraging our knowledge that the time series $h_{t}$ has to look like the trajectory of a local search?

^{^}
Up to affine transformations and such of course.

Answers

answer by Alexander Gietelink Oldenziel · 2025-02-03T13:59:46.634Z · LW(p) · GW(p)

I'm actually curious about a related problem.

One of the big surprises of the deep learning revolution has been the universality of gradient descent optimization.

How large is the class of optimization problems that we can transform into a gradient descent problem of some kind? My suspicision is that it's a very large class; perhaps there is even a general way to transform any problem into a gradient descent optimization problem?

The natural thing that comes to mind is to consider gradient descent of Langrangian energy functionals in (optimal) control theory.

↑ comment by robo · 2025-02-04T13:25:04.008Z · LW(p) · GW(p)

I'll conjecture the following in a VERY SPECULATIVE, inflammatory, riff-on-vibes statements:

Gradient descent solves problem in the complexity class P^[1]. It is P-Complete.
Learning theory (and complexity theory) have for decades been pushing two analogous bad narratives about the weakness of gradient descent (and P).
These narratives dominate because it is easy prove impossibility results like "Problem X can't be solved by gradient descent" (or "Problem Y is NP-Hard"). It's academically fecund -- it's a subject aspiring academics can write a lot of papers about. Results about what gradient descent (and polynomial time) can't do compose a fair portion of the academic canon
In practice, these impossible results are corner cases cases don't actually come up. The "vibes" of these impossibility results run counter to the "vibes" of reality
- Example, gradient descent solves most problems, even though it theoretically it gets trapped in local minima. (SAT is in practice fast to solve, even though in theory it's theoretical computer science's canonical Hard-Problem-You-Say-Is-Impossible-To-Solve-Quickly)
- The vibe of reality is "local (greedy) algorithms usually work"

^{^}
Stoner-vibes based reason: I'm guessing you can reduce a problem like Horn Satisfiability^[2] to gradient descent. Horn Satisfiability is a P-compete problem -- you can transform any polynomial-time decision problem in a Horn Satisfiability problem using a log-space transformation. Therefore, gradient descent is "at least as big as P" (P-hard). And I'm guessing you can your formalization of gradient descent in P as well (hence "P-Complete"). That would mean gradient descent is not be able to solve harder problems in e.g. NP unless P=NP
^{^}
Horn Satisfiability is about finding true/false values that satisfy a bunch of logic clauses of the form . or $b \land e \to ⊥$ (that second clause means "don't set both $b$ and $e$ to true -- at least one of them has to be false" ). In the algorithm for solving it, you figure out a variable that must be set to true or false, then propagate that information forward to other clauses. I bet you can do this with a loss function turning into a greedy search on a hypercube.

3 comments

Comments sorted by top scores.

comment by Cleo Nardo (strawberry calm) · 2025-02-04T13:39:52.138Z · LW(p) · GW(p)

Minimising some term like , with $δ (h^{'} (t)) := | | h_{t}^{'} - h_{t + 1}^{'} | |_{2}^{2}$ , where the standard deviation $σ (δ (h^{'}))$ and expectation $E (δ (h^{'}))$ are taken over the batch.

Why does this make $δ_{t} := | | h_{t} - h_{t + 1} | |_{2}^{2}$ tend to be small? Wouldn't it just encourage equally-sized jumps, without any regard for the size of those jumps?

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-02-04T15:10:19.315Z · LW(p) · GW(p)

Yes. The hope would be that step sizes are more even across short time intervals when you’re performing a local walk in some space, instead of jumping to a point with random distance to the last point at every update. There’s probably a better way to do it. It was just an example to convey the kind of thing I mean, not a serious suggestion.

comment by robo · 2025-02-04T14:56:50.923Z · LW(p) · GW(p)

Checking my understanding: for the case of training a neural network, would S be the parameters of the model (along with perhaps buffers/state like moment estimates in Adam)? And would the evolution of the state space be local in S space? In other words, for neural network training, would S be a good choice for H?
In a recurrent neural networks doing in-context learning, would S be something like the residual stream at a particular token?

Can we infer the search space of a local optimiser?

Contents

Setup

Example of what this could look like

Summary

Answers

3 comments