Price's equation for neural networks

post by tailcalled · 2022-12-21T13:09:16.527Z · LW · GW · 4 comments

Contents

  Neural network training and Price's equation
  Covariance niceties
None
4 comments

Price's equation is a fundamental equation in genetics, which can be used to predict how traits will change due to evolution. It can be phrased in many ways, but for the current post I will use the following simplified continuous-time variant:

Here,  represents some genetic trait,  represents the fitness of the organism,  represents the genes of an organism, and  represents the genetic covariance between the trait and the fitness. Usually people only use the  part of the equation[1], but I've written out the definition

because that will make the analogy to neural networks easier to see.

Neural network training and Price's equation

Suppose we train a neural network's weights  using the following equation, where  represents the loss for the network:

In that case, if we have some property  of the network (e.g.  could represent how a classifier labels an image, or how an agent acts in a specific situation, or similar), then we can derive an equation for 's evolution over time:

Similar to how we have a concept of genetic covariance to represent the covariance linked to genes, we should probably also introduce a covariance concept linked to neural network weights, to make it cleaner to talk about. I'll call that  (short for neural tangent covariance), defined as:

Furthermore, to make it closer to being analogous, we might replace  with , yielding the following equation for predicting the evolution of any property  with training under gradient descent:

This makes a bunch of idealistic assumptions about the training process, e.g. that we have an exact measure of the full gradient. It might be worth relaxing the math to more realistic assumptions, and check how much still applies. But for now, let's just charge ahead with the unrealistic assumptions.

Covariance niceties

Covariances play nicely with linear causal effects. If  and  are linear transformations, then .

For instance [LW · GW], suppose you have a reinforcement learner that has learned to drink juice when close to it. Suppose further that now the main determinant for whether it gets reward is whether it approaches juice when it sees juice. We might formalize that effect as , where  is the reward given to the agent,  is the frequency at which it sees juice that it can approach, and  is its likelihood of approaching juice if it sees it.

We can then compute: .

 is a special quantity which we could call the neural tangent variance . It represents the degree to which  is sensitive to the neural network parameters. For common situations, this may be dependent on the structure of the network, but also more directly on the nature and value of .

For instance, if  is the expectation of a binary variable with a probability  for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that  is approximately proportional to , but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.

In particular, this means that if  is very low (in the juice example, if it is exceedingly rare for the agent to approach juice it sees), then  will also be very low, and this will make  low and therefore also make  low.

  1. ^

    And usually people also put in other terms too to account for various distortions.

4 comments

Comments sorted by top scores.

comment by tailcalled · 2023-02-10T16:32:12.864Z · LW(p) · GW(p)

I've been meaning to write for a while now:

I've realized that since the derivative is infinitesimal, we can actually strengthen the covariance niceties a lot. If  and  are arbitrary functions, then I believe that:

comment by TurnTrout · 2022-12-21T17:26:35.883Z · LW(p) · GW(p)

I really like this post. Can you expand your intuitions on 

For instance, if  is the expectation of a binary variable with a probability  for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that  is approximately proportional to , but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.

Replies from: tailcalled
comment by tailcalled · 2022-12-21T22:51:20.492Z · LW(p) · GW(p)

Sure!

So let's start with a basic example, an agent that has two actions, "don't" and "do". Suppose it has an output neuron that contains the logits for what action to take, and for simplicity's sake (will address this in the end of the post) let's assume that this output neuron is controlled by a single weight  which represents its bias. So this means that the  variable described in the OP expands into: .

We can then compute . And, hmm, this actually implies that , rather than the  that my intuition suggested, I think? The difference is basically that  is flatter than , especially in the tails where the former quadratically goes to 0 while the latter linearly goes to 0.

One thing I would wonder is what happens during training, if we e.g. use policy gradients and give a reward of 1 for do and a reward of -1 for don't. The update rule for policy gradients is basically , which according to Wolfram Alpha expands into , and which we can further simplify to . But we would have to square it to get , so I guess the same point applies here as to before. 🤷

Anyway, obviously this is massively simplified because we are assuming a trivial neural network. In a nontrivial one, I think the principle would be the same, due to the chain rule which gives you a factor of  onto whatever gradients exist before the final output neuron.

comment by tailcalled · 2022-12-21T13:49:46.285Z · LW(p) · GW(p)

Actually upon further thought for something like policy gradients, in the limit where the probability  is close to , then  would probably be more like ? Because you get a factor of  from the probability, and then an additional factor of  from the derivative of sigmoid/softmax, which adds up to it being .