Measuring Learned Optimization in Small Transformer Models

j-bostock

Measuring Learned Optimization in Small Transformer Models

post by J Bostock (Jemist) · 2024-04-08T14:41:27.669Z · LW · GW · 0 comments

  Methods
    Pretraining on Sequence Prediction
    Measuring Optimization
  Results
    Optimization vs RL Success Rate
    Cross-Objective Evaluation
    Optimization vs Impact
    Model Self-Evaluation
  Conclusions
      Alex Turner's Existing Work
  Appendices
    Appendix A: Impact and Differential Impact
      Derivation of Imp
      Differential Impact
    Appendix B: Proofs
      Derivation and proof of Imp≥Impmin
      Derivation of Differential Impact
      Derivation of Multivariate Differential Impact and Optimization
    Appendix C: Supplementary Plots
      Other Ways to Visualize Impact Plots
      Example training runs from ngood = 4
      Example figures summarizing training runs:
None
No comments

This is original, independent research carried out in March and April of 2024.

A graphical abstract: A formula generates data which is used to pretrain models. These are then subject to three RL tasks of varying difficulty. The reward function of a model can be identified by looking at a task-specific prediction of optimization as comapared to measured optimization.

The degree to which a a policy optimizes the future can be quantified mathematically. A set of of very small transformer models were pretrained to predict the next token in a mathematical sequence, then subjected to reinforcement learning finetuning.

The optimizing power of each model can be predicted with high accuracy based on each model's score on its own RL task. By comparing predictions of optimization based on scores on each different RL task, a model's original reinforcement objective can be identified.

A related measure for impact can also be derived mathematically, and given a theoretical lower bound based on RL score. This gives further information about model behavior, and allows for the same analysis as the measure of optimization.

I also investigate the possibility of getting models to self-evaluate optimization and impact, with limited success.

Methods

Pretraining on Sequence Prediction

I defined a simple mathematical sequence defined by the following stochastic recurrence relation. This produces a pseudo-random but (to 98%) predictable sequence, alternating between elements of on even values of $t$ and ${8, . . ., 15}$ on odd values of $t$ .

$s_{t} = (⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} ((16 \prod i = 1 (s_{t - i} + 1) mod 17) mod 8) & with probability 98% \in {0, . . ., 7} & with probabiltiy 2% \end{matrix}) + 8 \times (t mod 2)$

I then trained a small encoder-only transformer model to predict the next element in the sequence given the previous 20 elements of the sequence.

This was followed by a reinforcement-learning phase in which the transformer was used to generate the next token on odd values of $n$ only, and the recurrence relation was used to generate the value of $s_{t + 1}$ . If $s_{t + 1}$ was in ${0, 2, 4, 6}$ , this was used as a "successful" example to reinforce the model. I used a temperature of 1 when generating these sequences to introduce some randomness, but the temperature was reduced to 0 during evaluations and when calculating optimization.

A small amount of "maintenance" training (much lower learning rate) was used during this phase to ensure that model performance on the predictive tasks for even values of $t$ was maintained. Without this I saw rapid loss of performance on the "maintenance" dataset. I also found that I was unable to include "unsuccessful" examples (i.e. where $s_{t + 1} \notin {0, 2, 4, 6}$ ) with even a tiny negative learning rate, as this caused worsened performance at all tasks. Here is a typical set of results from training and evaluation:

I carried out this training on $N = 5$ models per size for four model sizes between 18k and 402k parameters, giving the following plot:

On the left is a plot showing pretraining and maintenance loss, on the right is a graph showing success rate on a task. Both are a function of parameters. Pretraining and maintenance loss show a U-shape, success generally increases with the number of parameters.

Pretraining loss increases over the last few model sizes, and the loss/time plots (some of which I have put in the Supplementary Information at the bottom of this post) showed signs of overfitting in the large models. Regularization was employed during training (0.01 weight decay in an AdamW optimizer, 10% dropout rate for neurons) so perhaps a larger dataset size is required to totally avoid this.

I then repeated the RL phase twice, once with $s_{t + 1} \in {0, 4}$ being reinforced, ( $n_{g o o d}$ = 2) and once with $s_{t + 1} \in {0, 1, 2, 4, 5, 6}$ being reinforced ( $n_{g o o d}$ = 6). Here is a plot of success rate against model size across all three conditions.

A plot showing success rate against model size on different tasks. Success rate is better than chance in all conditions, and increases with model size in all conditions.

This plot shows mean $\pm$ standard error. In all cases model performance is a lot better than chance, and increases with model size.

Measuring Optimization

I used a Monte Carlo simulation to measure the nats of optimization that are being applied to $s_{t + 1}$ using the split-history method I've previously outlined. This involves taking the difference in entropy between two distributions:

A diagram with the following equation at the top: Op(F, P; A) = H(F'|P'') - H(F). Below are two Bayes nets. The one on the left has nodes P, A,and F with nodes P to A, P to F and A to F. The network on the left has nodes P, P'', A', and F'. It has arrows P' to F', P'' to A', and A' to F'.

The algorithm in practice is this:

Take a bunch of sequence examples from the testing data, ensuring that $t$ is odd.
Feed them into the models to get a value for $s_{t}$ , append to the sequence.
Use the sequence-generator to get a set of values for $s_{t + 1}$
Look at the entropy of the resulting distribution over $s_{t + 1}$ , this is the optimized entropy.
For each sequence from the training data, and get a value of $s_{t}^{'}$ from this.
Repeat Steps 3-4 with the entire data set as $. . ., s_{t - 2}, s_{t - 1}$ prepended to $s_{t}^{'}$ , get one sample of "unoptimized" entropy
Repeat Steps 5-6 with each sequence from the initial dataset, take the average unoptimized entropy
Optimization = unoptimized entropy - optimized entropy

Here is a schematic illustration:

A diagram showing the process outlined above. The final equation is Op(F, P; A) = -H(F) + H(F'|P'') — On the left, the model is able to respond to each sequence fed into it, so it can optimize the future creating a low-entropy distribution. On the right, the model is forced to give a fixed response to the external input, which is then appended to the sequences. This means that it cannot optimize the future and the entropy is higher.

I ran this calculation 10 times with 200 sequences in $P$ and took an average to get an idea of the model's optimizing capability. I also tested the sequence-generating function's self-optimization.

The fact that the sequence is optimizing itself mostly just amounts to saying that it is not a random walk, which we already knew. It is a good sanity check that all of the models get values either equal to or above this, and that optimization improves with model size.

Results

Optimization vs RL Success Rate

Optimization is calculated as the entropy difference of two distributions. Let us consider three parameters: $n$ : the number of possible outcomes; $p$ : the proportion of outcomes which are "successes"; and $s$ : the chance that the model achieves a successful outcome.

Assuming the model is "ambivalent" over successful outcomes, and likewise ambivalent over failed outcomes, then the value of $H (F)$ should be equal to $- s ln (\frac{s}{n p}) - (1 - s) ln (\frac{1 - s}{n (1 - p)})$ . If we then assume that all outcomes are equally likely when the model's outputs are "randomized", then $H (F^{'} | P^{''})$ is just $- ln (\frac{1}{n})$ . If we take the difference we get the following expression:

$O p \approx s ln (\frac{s}{n p}) + (1 - s) ln (\frac{1 - s}{n (1 - p)}) - ln (\frac{1}{n})$

$O p \approx s ln \frac{s}{p} + s ln \frac{1}{n} + (1 - s) ln \frac{1 - s}{1 - p} + ln \frac{1}{n} - s ln \frac{1}{n} - ln \frac{1}{n}$

$O p \approx s ln \frac{s}{p} + (1 - s) ln \frac{1 - s}{1 - p}$

Now I can plot this theoretical value for $O p$ against $s$ for $p \in {0.25, 0.5, 0.75}$ , and also plot all of the models on the same axes. Since we see a lot of run-to-run variation in model performance, I'll plot the raw data per model rather than statistics.

A plot showing optimization plotted against success rate for three conditions. Three series of points are shown, each sitting on a similarly-colored curve.

I tried to find some galaxy-brained variable change which would make those three curves the same, but I couldn't. Instead I will just plot the predicted value of $O p$ (based on their success rate) for each model against the actual value:

The headline figure is repeated but with annotations. All the points lie very close to the theoretical line y=x.

In theory none of the models should fall below the dashed line representing the equation derived above. In practice they do. Some of this is error in measurements, but I'm sure some of it is also due to errors in my assumptions. In particular the assumption that $H (F^{'} | P^{''})$ is completely flat is unlikely to hold.

On the other hand, there there is no reason at all why the models shouldn't fall quite far above the theoretical line. Consider the $n_{g o o d} = 6$ case, so successful values are $s_{t + 1} \in {0, 1, 2, 4, 5, 6}$ . If the model ends up with a 85:15 ratio of values 2 and 3 for $s_{t + 1}$ (and never gets any other number, which is impossible in this specific case but that's not important for my point) then it will have a success rate of 0.85, which implies an $O p$ of 0.03, but its actual $O p$ will be 1.66!

A plot showing potential different outcomes to training. These outcomes are listed below. — Sorry for the poor aesthetics of this diagram, it's very difficult to get the point across that the bottom left one has the lowest optimization.

For a given training performance (in this case I've tried to make them all ~75% correct) the model might do one of several things:

Optimize for reinforced results and against non-reinforced results in an unbiased way
Optimize against non-reinforced results correctly, but only optimize for some of the successful results
Optimize for reinforced results correctly, but only optimize against some of the non-reinforced results
Behave in such a way that certain reinforced results are optimized against and certain non-reinforced results are optimized for.

The last three are examples of one type of inner misalignment. All three are characterized by excess optimization compared to the "correct" answer.

Cross-Objective Evaluation

The fact that most of the models don't sit far above the line shows that the RL is doing its job! They are (mostly) pursuing the training objective faithfully, without picking up on other objectives. I simulate what might happen if a model picks up on an "incorrect" objective by cross-evaluating models on each others' tasks:

A plot showing Measured Optimization against Score on n_good=4 task. The points corresponding to policies trained on the n_good=4 task lie on the theoretical curve, models trained on other tasks cluster to the left and above it.

Now it becomes obvious that the $n_{g o o d} = 4$ -trained models are behaving correctly whereas the other models are not. This is important since a model trained on the $n_{g o o d} = 2$ task will typically have a good score on the $n_{g o o d} = 4$ task, sometimes even higher than a similarly-sized model trained on the $n_{g o o d} = 4$ task (!) but this plot clearly shows us that something is going wrong.

It becomes even clearer if we instead take the value of $O p$ implied by the formula from above, and plot it against the actual value of $O p$ . Here are the results for all three tasks:

A plot with predicted optimization on the x axis and measured on the y axis. The purple points for n_good=2 sit on a purple y=x line. The other points mainly sit above the line.

As above but with blue n_good=4 points sitting on a blue line, and the other clusters above.

As above but with yellow n_good = 2 points on the line and other clusters above it.

By comparing implied and measured $O p$ , we can separate the models trained on a given reward signal from ones trained on a different reward signal, even when one of the reward signals is a subset of the other.

Optimization vs Impact

I will now take a second look at the measure for $O p$ . What I really wanted all along is to measure a policy's impact on the world, and optimizing the future is only one way in which this can happen. Another way is flattening the possibilities out! Consider the following informal diagram. "Original" distribution is not well-defined here, the point is just to give an intuitive explanation:

A spiky original distribution is subject to two types of impact. With Optimizing impact, the distribution becomes spikier, and lower entropy. With Flattening impact, the distribution becomes flatter and higher entropy.

The motivation and derivations for this can be found in Appendix A, with longer proofs listed in Appendix B, but the upshot is that we can define a new function: $I m p (F, P; A)$ using the KL divergence of $F$ and $F^{'}$ like this:

Which if we have a task success rate $s$ in $F$ and $s^{'}$ in $F^{'}$ must obey the following equation:

$I m p (F, P; A) \equiv D_{K L} (F ∥ F^{'} | P^{''}) \geq s ln \frac{s}{s^{'}} + (1 - s) ln \frac{1 - s}{1 - s^{'}} \equiv {I m p}_{m i n} (F, P; A)$

This can be measured using the same split-history method used to measure $O p$ :

A plot showing impact against parameters. Impact increases with parameter count and decreases with n_good

The lower bound ${I m p}_{m i n}$ can be calculated from success rates and the two can be compared:

A model showing Measured Impact plotted against Minimum Impact From Score. The points form a plume around y = 2.6x + 0.059

Larger models have a higher impact, but they also tend to have a lower ratio of $I m p / {I m p}_{m i n}$ than do the smaller ones. The single-line fit is also somewhat misleading, and actually the models appear to lie on three separate lines. I note that the values of ${I m p}_{m i n}$ are somewhat low in this case, rather than our measured impact being unexpectedly high. Our measure for $I m p$ may not be perfect in this case.

I think the difference between $I m p$ and ${I m p}_{m i n}$ tells us something. In this case it tells us that the success rates in $F^{'}$ are actually quite high:

A plot showing success in F' against parameter count. Models for all values of n_good have better performance than chance, and it seems to slightly scale with model size.

Success rate in $F^{'}$ doesn't scale very much at all with model size, which indicates that models of all sizes exhibit roughly the same bias in outputs, based on statistical regularities in the data (e.g. a "13" is more commonly followed by a "2" than a "15" is). It might also be an artefact stemming from an uneven distribution of numbers in our sequence by default.

This highlights an important point about split-history measures: they only pick up on the effects of "active" inference in the policy, not on biases in the outputs which are produced during training. We don't have to use a split-history method to generate $F^{'}$ , we can think up plenty of alternative ways to generate a comparison future, and therefore generate a family of measures for $I m p$ and $O p$ :

Use a privileged "Do nothing" token as the model's output
Use a totally random output token
Use an untrained model, or a pretrained (but not finetuned) model

Despite these problems, if we repeat cross-objective evaluation using $I m p$ we get these plots:

A plot showing measured Impact against Imp_min based on the n_good=2 task. Points trained on the "correct" task form a plume above the line y=x, whereas points trained on the n_good=4 task are squished over to the left. — The $n_{g o o d} = 6$ models don't appear because they do so badly on the task that ${I m p}_{m i n}$ cannot be calculated.

As above but the Imp_min is based on score on the n_good=4 task. The correct policies lie above the line y=x, the n_good=2 models lie far above them, and the n_good=6 lines are squished over to the left

While the split is not as good as we saw with $O p$ , it seems that models not trained on the "correct" fall along a line of lower gradient, and models trained on "incorrect" tasks have higher gradients i.e. the latter diverge from the ${I m p}_{m i n}$ line more rapidly.

There are many potential ways to visualize what's going on here, two more of which are in Appendix C at the end of this post.

Model Self-Evaluation

A secondary goal of my work was to get models to evaluate their own optimizing power and impact. This is the reason for the maintenance training during finetuning. The self-evaluation scheme is the same as the optimization measuring scheme, except that instead of using the sequence-generating formula to get values for $s_{t + 1}$ , I used the model's own predictive power to get expected distributions over $s_{t + 1}$ and $s_{t + 1}^{'}$ :

A plot showing measured optimization plotted against self-estimated optimzation. Models from the $n_{g o o d} = 6$ task tend to fall on the line $y = x$ , but as $n_{g o o d}$ decreases, models fall further above the line.

I suspect what's happened here is bog standard model bias. The more heavily the model is optimizing, the less even the distribution of examples of $s_{t + 1}$ it gets fed during the maintenance training. This might make it more biased towards predicting the rewarded variables after any sequence, which would cause it to underestimate $O p$ .

The same can be done for $I m p$ :

Which gives even worse results! I'm not sure whether or not this is also model bias. I don't have a tremendous amount of experience with ML, so I will not pursue this further in this post.

Conclusions

The measure I've introduced for $O p$ seems to be predicted very well by theory, and might have utility in detecting how well a policy's results line up with an RL task when it can be predicted accurately. The measure for $I m p$ has some issues but shows some success towards the same goal. These can be considered complementary measures in many ways:

Optimization	Impact
Can be positive or negative, the sign gives us information	Always $\geq 0$
Can be cancelled out by opposite-signed optimization elsewhere	Always a positive value, always at least adds up across different sources
No easy way to calculate theoretical values of Op in general (that I've found)	${I m p}_{m i n}$ can be calculated for any success/failure RL task
Excellent at identifying model behaviour in this case	Less good at identifying model behaviour in this case

I would be interested in applying both of these to more complex ML systems, but at the moment I don't any specific plans to do so, due to my very limited experience with ML. In Appendix A I also present equivalent measures which can be applied to the activations of a neural network.

A theoretical limitation of this work is that ${I m p}_{m i n}$ can only be calculated when a task is binary success/failure. If the policy is trying to maximize some function world-state then there's no simple formula for ${I m p}_{m i n}$ based on RL score. Work on these systems must either develop new maths or or find a way to break down utility maximization into a series of success/failure tasks, which can separately be evaluated for impact.

One practical limitation is in the self-evaluation stage. If the model's self-evaluation is biased, the measures of $O p$ and $I m p$ will be as well. This is especially problematic if $O p$ or $I m p$ is underestimated, allowing the policy to "lie" to us about how much optimization/impact it is doing. My wild guess is that AIs which include explicit world-modelling (such as DreamerV3) will be less biased by default, but that attempts at introducing impact regularization might reintroduce bias.

Alex Turner's Existing Work

The expression for impact that I present here is totally independent with the AUP measure of impact as proposed by Alex Turner in Towards a New Impact Measure [AF · GW]. There might be some subtle way in which they're related but I haven't thought about this enough to say more. I've also read World State is the Wrong Abstraction for Impact [LW · GW] and agree with some of the points presented.

In response I would say that the metric I present here relies strongly on a model $F$ of the future world state, so only details captured in $F$ can affect the impact. In the limit where the future states only consist of $f \in {s u c c e s s, f a i l u r e}$ , then impact is trivially equal to the lower bound and excess impact = 0.

Appendices

Appendix A: Impact and Differential Impact

Derivation of $I m p$

I present the motivation for, and derivation of, the measure of $I m p$ which I've used in this post. Returning to my estimator for $O p$ :

$O p \approx s ln \frac{s}{p} + (1 - s) ln \frac{1 - s}{1 - p}$

If we think these being two distributions over ${s u c c e s s, f a i l u r e}$ and think of $p$ as the probability of succeeding by chance, this becomes the formula for a Kullback-Liebler divergence. For a quick recap, the KL divergence of two distributions over a variable $x \in χ$ follows this formula:

$D_{K L} (P ∥ Q) = \sum x \in χ P (x) ln \frac{P (x)}{Q (x)}$

We could also estimate the probability of succeeding by chance using the split-history method. Let $s$ be the success rate in $F$ , and $s^{'}$ be the success rate in $F^{'}$ :

$s ln \frac{s}{s^{'}} + (1 - s) ln \frac{1 - s}{1 - s^{'}}$

I will now present some results about KL divergences. I will start by defining $χ_{s} \subset χ$ as the successful outcomes. Let us define a "baseline" probability distribution $P (x)$ and from it a "baseline" success rate $s_{0}$ :

$s_{0} = \sum x_{s} \in χ_{s} P (x_{s})$

Now imagine a policy acts on this distribution $P (x)$ and changes it to a new distribution. If this policy has a success rate $s \in [s_{0}, 1]$ , I define $Q (x)$ as follows:

$Q (x) = {\begin{matrix} \frac{s}{s_{0}} P (x) & x \in χ_{s} \frac{1 - s}{1 - s_{0}} P (x) & x \notin χ_{s} \end{matrix}$

This is what we might expect to be the effects of a "minimum impact" policy: it makes successful outcomes more likely and unsuccesful outcomes less likely while leaving relative probabilities otherwise intact. The KL-divergence between $Q$ and $P$ can then be calculated, and it looks familiar:

$D_{K L} (Q ∥ P) = s ln \frac{s}{s_{0}} + (1 - s) ln \frac{1 - s}{1 - s_{0}}$

This is the minimum KL divergence possible in shifting a distribution to achieve a given success rate. If we change the variable names to $Q \to F, P \to F^{'} | P^{''}, s_{0} \to s^{'}$ this allows us to write down the original relation for a policy's impact:

$I m p (F, P; A) \equiv D_{K L} (F ∥ F^{'} | P^{''}) \geq s ln \frac{s}{s^{'}} + (1 - s) ln \frac{1 - s}{1 - s^{'}} \equiv {I m p}_{m i n} (F, P; A)$

Differential Impact

In this case "differential" means "to do with differentiation". I've studied a construct I call differential optimization in the past, as it pertains to functions of real-valued variables. In this case if we have the functions $A = f (P)$ , $F = g (A, P) \equiv h (P)$ we can define the following value:

$C (F, P; A) = \frac{\partial F}{\partial P} |_{A v a r i e s} / \frac{\partial F}{\partial P} |_{A c o n s t a n t} \equiv \frac{d h}{d P} / \frac{\partial g}{\partial P}$

Intuitively, if $A$ is "optimizing" $F$ , then when it is allowed to vary, $C \leq 1$ since $F$ will change less when we allow $A$ to change than when we fix $A$ . This led to the derivation of the differential optimization $O p = - ln (C)$ .

This can be extended to a differential impact metric:

$I m p \equiv - ln (C) + C (C - 1)$

This has a minimum $I m p = 0$ at $C = 1$ , but it is speculative and has not been tested, so while I will present it here I can give no guarantees at all about its utility.

We can also extend this to vector-valued $P, F$ . If we define $J_{g}$ as the Jacobian when $A$ varies, and $J_{h}$ as the Jacobian when $A$ is constant, then $C = J_{g}^{- 1} J_{h}$ we get the following values for $O p$ and $I m p$ :

$O p = - ln | C | \equiv ln | J_{g} | - ln | J_{h} |$

$I m p = - ln | C | + T r (C C^{T}) - T r (C)$

If $P$ and $F$ do not have the same dimension, then $J_{g}^{- 1}$ does not exist and instead the following construction must be used:

$O p = \frac{1}{2} ln | J_{g} J_{g}^{T} | - \frac{1}{2} ln | J_{h} J_{h}^{T} |$

$I m p = \frac{1}{2} ln | J_{g} J_{g}^{T} | - \frac{1}{2} ln | J_{h} J_{h}^{T} | + T r (J_{h} J_{h}^{T} (J_{g} J_{g}^{T})^{- 1}) - T r (J_{g} J_{h}^{T} (J_{g} J_{g}^{T})^{- 1})$

The motivation for constructions like this is to apply them to the activations of neural networks. For a network with width $w$ , and a backpropagation time of $t$ , I believe the time-complexity of this contains a polynomial term in $w$ (possibly $O (w^{5} (log w)^{2})$ if using Bareiss algorithm) for the matrix inverse, and a term in $w \times t$ .

Appendix B: Proofs

Derivation and proof of $I m p \geq {I m p}_{m i n}$

$D_{K L} (Q ∥ P) = \sum x_{s} \in χ_{s} \frac{s}{s_{0}} P (x_{s}) ln \frac{\frac{s}{s_{0}} P (x_{s})}{P (x_{s})} + \sum x_{u} \notin χ_{s} \frac{1 - s}{1 - s_{0}} P (x_{u}) ln \frac{\frac{1 - s}{1 - s_{0}} P (x_{u})}{P (x_{u})}$

$D_{K L} (Q ∥ P) = \frac{s}{s_{0}} \sum x_{s} \in χ_{s} P (x_{s}) ln \frac{s}{s_{0}} + \frac{1 - s}{1 - s_{0}} \sum x_{u} \notin χ_{s} P (x_{u}) ln \frac{1 - s}{1 - s_{0}}$

$D_{K L} (Q | P) = \frac{s}{s_{0}} \times s_{0} ln \frac{s}{s_{0}} + \frac{1 - s}{1 - s_{0}} \times (1 - s_{0}) ln \frac{1 - s}{1 - s_{0}}$

$D_{K L} (Q ∥ P) = s ln \frac{s}{s_{0}} + (1 - s) ln \frac{1 - s}{1 - s_{0}}$

I will prove that this choice of $Q$ is a global minimum of $D_{K L} (Q ∥ P)$ for a fixed $P$ .

Consider a distribution $R = Q + δ Q$ , which involves moving some amount of probability mass $δ$ from $x_{1}$ to $x_{2}$ . Without loss of generality, take both to be in $χ_{s}$ (they must both be in either $χ_{s}$ or $χ_{s}^{C}$ so that $R (x \in χ_{s}) = s$ holds) Consider the value of

$D_{K L} (R ∥ P) - D_{K L} (Q ∥ P)$

Trivially we only need to look at the components relevant to $x_{1}$ and $x_{2}$ :

$R (x_{1}) ln \frac{R (x_{1})}{P (x_{1})} + R (x_{2}) ln \frac{R (x_{2})}{P (x_{2})} - Q (x_{1}) ln \frac{Q (x_{1})}{P (x_{1})} - Q (x_{2}) ln \frac{Q (x_{2})}{P (x_{2})}$

Expand values of $R (x)$ :

$(Q (x_{1}) - δ) ln \frac{\frac{s}{s_{0}} P (x_{1}) - δ}{P (x_{1})} + (Q (x_{2}) + δ) ln \frac{\frac{s}{s_{0}} P (x_{2}) + δ}{P (x_{2})} - Q (x_{1}) ln \frac{\frac{s}{s_{0}} P (x_{1})}{P (x_{1})} - Q (x_{2}) ln \frac{\frac{s}{s_{0}} P (x_{2})}{P (x_{2})}$

Expand and collect factors of $Q (x)$ , cancelling the $P (x)$ on the bottom:

$Q (x_{1}) ln \frac{\frac{s}{s_{0}} P (x_{1}) - δ}{\frac{s}{s_{0}} P (x_{1})} - δ ln \frac{\frac{s}{s_{0}} P (x_{1}) - δ}{P (x_{1})} + Q (x_{2}) ln \frac{\frac{s}{s_{0}} P (x_{2}) + δ}{\frac{s}{s_{0}} P (x_{2})} + δ ln \frac{\frac{s}{s_{0}} P (x_{2}) + δ}{P (x_{2})}$

Collect the factors of $δ$ , expand stuff to a $ln (1 + y)$ form:

$Q (x_{1}) ln (1 - \frac{s_{0}}{s} \frac{δ}{P (x_{1})}) + Q (x_{2}) ln (1 + \frac{s_{0}}{s} \frac{δ}{P (x_{2})}) + δ [ln \frac{\frac{s}{s_{0}} P (x_{2}) + δ}{P (x_{2})} - ln \frac{\frac{s}{s_{0}} P (x_{1}) - δ}{P (x_{1})}]$

$Q (x_{1}) ln (1 - \frac{s_{0}}{s} \frac{δ}{P (x_{1})}) + Q (x_{2}) ln (1 + \frac{s_{0}}{s} \frac{δ}{P (x_{2})}) + δ [ln (1 + \frac{s_{0}}{s} \frac{δ}{P (x_{2})}) - ln (1 - \frac{s_{0}}{s} \frac{δ}{P (x_{1})})]$

Use the taylor expansion $ln (1 + y) \approx y - \frac{1}{2} y^{2} . . .$ up to $δ^{2}$ .

$Q (x_{1}) (- \frac{s_{0}}{s} \frac{δ}{P (x_{1})} - \frac{1}{2} (- \frac{s_{0}}{s} \frac{δ}{P (x_{1})})^{2} . . .) + Q (x_{2}) (\frac{s_{0}}{s} \frac{δ}{P (x_{2})} - \frac{1}{2} (\frac{s_{0}}{s} \frac{δ}{P (x_{2})})^{2} . . .) + δ [(\frac{s_{0}}{s} \frac{δ}{P (x_{2})} . . .) - (\frac{s_{0}}{s} \frac{δ}{P (x_{1})} . . .)]$

Sub in $Q (x) = \frac{s}{s_{0}} P (x)$ , expand, cancel:

$- \frac{1}{2} δ^{2} \frac{s_{0}}{s} (\frac{1}{P (x_{1})} + \frac{1}{P (x_{2})}) + δ^{2} \frac{s_{0}}{s} (\frac{1}{P (x_{2})} + \frac{1}{P (x_{1})})$

Subtract:

$\frac{1}{2} δ^{2} \frac{s_{0}}{s} (\frac{1}{P (x_{1})} + \frac{1}{P (x_{2})}) \geq 0$

Therefore $Q (x)$ is a local minimum of $D_{K L} (Q ∥ P)$ subject to our condition that $Q (X \in χ_{x}) = s$ . $D_{K L} (Q ∥ P)$ is convex in $Q$ for fixed $P$ , therefore we have found the unique global minimum.

Derivation of Differential Impact

The measure of $O p$ based on entropy that I've used here was based on the following comparison to differential optimization:

Consider the network $A = P$ , $F = P - (1 - k) A$ . This gives $C (F, P; A) = k$ and $O p (F, P; A) = - ln k$ .

This can be extended to an entropic measure of $O p$ by considering uncertainty over $P$ , specifically:

$P \sim N (μ_{P}, σ_{P})$

$A \sim N (μ_{P}, σ_{P})$

$F \sim N (k μ_{P}, k σ_{P})$

Using split-histories we get:

$P^{'}, P^{''} \sim N (μ_{P}, σ_{P})$

$A^{'} \sim N (μ_{P}, σ_{P})$

$F^{'} | P^{''} = p^{''} \sim N (μ_{P} - (1 - k) p^{''}, σ_{P})$

If we take $O p (F, P; A) = H (F^{'} | P^{''}) - H (F)$ this gives the familiar value of $ln k$ . We may instead investigate the value of $I m p (F, P; A) = D_{K L} (F ∥ F^{'} | P^{''})$ . Letting $F \sim N (μ_{1}, σ_{1})$ , $F^{'} | P^{''} \sim N (μ_{2}, σ_{2})$ for brevity:

$I m p = ln (\frac{σ_{2}}{σ_{1}}) + \frac{1}{2} \frac{σ_{1}^{2} + (μ_{1}, μ_{2})^{2}}{σ_{2}^{2}} - \frac{1}{2}$

Substituting:

$μ_{1} - μ_{2} = k μ_{P} - μ_{P} + (1 - k) p^{''} = (1 - k) p^{''} - (1 - k) μ_{P}$

$(μ_{1} - μ_{2})^{2} = (1 - k)^{2} (μ_{P}^{2} - 2 μ_{P} p^{''} + p^{'' 2})$

Taking $E ((μ_{1} - μ_{2})^{2})$ with respect to $p^{''}$ requires taking $E (p^{''}) = μ_{P}$ , $E (p^{'' 2}) = μ_{P}^{2} + σ_{P}^{2}$ :

$(μ_{1} - μ_{2})^{2} = (1 - k)^{2} (μ_{P}^{2} - 2 μ_{P}^{2} + μ_{P}^{2} + σ_{P}^{2}) = σ_{P}^{2}$

$σ_{1} = k σ_{P}$

$σ_{2} = σ_{P}$

Substituting into our original equation:

$I m p = ln (1 / k) + \frac{1}{2} [k^{2} + (1 - k)^{2} \frac{σ_{P}^{2}}{σ_{P}^{2}}] - \frac{1}{2}$

$I m p = - ln (k) + \frac{1}{2} [k^{2} + 1 - 2 k + k^{2}] - \frac{1}{2}$

$I m p = - ln (k) + k^{2} - k$

$I m p = - ln (k) + k (k - 1)$

Which, if we extend to $C$ , gives

$I m p = - ln (C) + C (C - 1)$

Derivation of Multivariate Differential Impact and Optimization

Let us take vectors $p, a, f$ , and $p^{'}, p^{''}, a^{'}, f^{'}$ in the same manner as above. Assume around some value of $p$ we have the following Jacobians.

$J_{f} = \frac{d a}{d p}$ $J_{g} = \frac{\partial f}{\partial p}$ $J_{h} = \frac{d f}{d p}$

Without loss of generality, take the means of all of these variables to be $0$ . There exists a formula for transforming a multivariate normal distribution^[1].

$P \sim N (0, Σ_{p})$

$A \sim N (0, J_{f} Σ_{p} J_{f}^{T})$

$F \sim N (0, J_{h} Σ_{p} J_{h}^{T})$

Now for $f^{'} | p^{''}$ , the mean will no longer be zero:

$F^{'} | P^{''} = p^{''} \sim N ((J_{h} - J_{g}) p^{''}, J_{g} Σ_{p} J_{g}^{T})$

We can calculate the KL divergence of $D_{K L} (F ∥ F^{'} | P^{''})$ using another formula^[2]:

$\frac{1}{2} [ln \frac{| Σ_{2} |}{| Σ_{1} |} - n_{d i m} + T r (Σ_{2}^{- 1} Σ_{1}) + (μ_{2} - μ_{1})^{T} Σ_{2}^{- 1} (μ_{2} - μ_{1})]$

$Σ_{1} = J_{h} Σ_{p} J_{h}^{T}$
$Σ_{2} = J_{g} Σ_{p} J_{g}^{T}$
$μ_{2} - μ_{1} = (J_{h} - J_{g}) p^{''}$

Therefore our impact will be:

$\frac{1}{2} [ln \frac{| J_{g} Σ_{p} J_{g}^{T} |}{| J_{h} Σ_{p} J_{h}^{T} |} - n_{d i m} + T r ((J_{g} Σ_{p} J_{g}^{T})^{- 1} J_{h} Σ_{p} J_{h}^{T}) + p^{'' T} (J_{h} - J_{g})^{T} (J_{g} Σ_{p} J_{g}^{T})^{- 1} (J_{h} - J_{g}) p^{''}]$

Taking the expected value of the third component is actually easy if you have access to the internet. We can see that it is of the form $E (v^{T} M v)$ where $v$ is multivariate normal. This has a closed-form solution^[3]:

$μ^{T} M μ + T r (M Σ)$

Therefore we have the following expression:

$\frac{1}{2} [ln \frac{| J_{g} Σ_{p} J_{g}^{T} |}{| J_{h} Σ_{p} J_{h}^{T} |} - n_{d i m} + T r ((J_{g} Σ_{p} J_{g}^{T})^{- 1} J_{h} Σ_{p} J_{h}^{T}) + T r ((J_{h} - J_{g})^{T} (J_{g} Σ_{p} J_{g}^{T})^{- 1} (J_{h} - J_{g}) Σ_{p})]$

We can make some progress towards simplifying this if we take $Σ_{p} = σ_{p} I$ , which in this case lets us cancel everything out involving a $Σ_{p}$ , since the scalar value of $σ_{p}$ commutes with all matrices, $| σ_{p} I | = σ_{p}^{n_{d i m}}$ , and $(σ_{p} I)^{- 1} = σ_{p}^{- 1} I$ . We will also assume that all the Jacobians are invertible.

$\frac{1}{2} [ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} - n_{d i m} + T r ((J_{g} J_{g}^{T})^{- 1} J_{h} J_{h}^{T}) + T r ((J_{h} - J_{g})^{T} (J_{g} J_{g}^{T})^{- 1} (J_{h} - J_{g}))]$

$\frac{1}{2} [ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} - n_{d i m} + T r ((J_{g} J_{g}^{T})^{- 1} J_{h} J_{h}^{T}) + T r (J_{h}^{T} (J_{g}^{T})^{- 1} J_{g}^{- 1} J_{h}) - T r (J_{h}^{T} (J_{g}^{T})^{- 1} J_{g}^{- 1} J_{g}) - T r (J_{g}^{T} (J_{g}^{T})^{- 1} J_{g}^{- 1} J_{h}) + T r (J_{g}^{T} (J_{g}^{T})^{- 1} J_{g}^{- 1} J_{g})]$

$\frac{1}{2} [ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} - n_{d i m} + T r ((J_{g} J_{g}^{T})^{- 1} J_{h} J_{h}^{T}) + T r (J_{h}^{T} (J_{g} J_{g}^{T})^{- 1} J_{h}) - T r (J_{h}^{T} (J_{g}^{T})^{- 1}) - T r (J_{g}^{- 1} J_{h}) + T r (I)]$

$\frac{1}{2} [ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} + T r ((J_{g} J_{g}^{T})^{- 1} J_{h} J_{h}^{T}) + T r (J_{h}^{T} (J_{g} J_{g}^{T})^{- 1} J_{h}) - T r (J_{h}^{T} (J_{g}^{T})^{- 1}) - T r (J_{g}^{- 1} J_{h})]$

$\frac{1}{2} [ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} + T r ((J_{g} J_{g}^{T})^{- 1} J_{h} J_{h}^{T}) + T r (J_{h}^{T} (J_{g} J_{g}^{T})^{- 1} J_{h}) - 2 T r (J_{g}^{- 1} J_{h})]$

Using the cyclic property of the trace:

$\frac{1}{2} [ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} + 2 T r ((J_{g} J_{g}^{T})^{- 1} J_{h} J_{h}^{T}) - 2 T r (J_{g}^{- 1} J_{h})]$

$\frac{1}{2} ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} + T r ((J_{g}^{T})^{- 1} J_{g}^{- 1} J_{h} J_{h}^{T}) - T r (J_{g}^{- 1} J_{h})$

$\frac{1}{2} ln \frac{| J_{g} J_{g}^{T} |}{| J_{h} J_{h}^{T} |} + T r (J_{g}^{- 1} J_{h} J_{h}^{T} (J_{g}^{T})^{- 1}) + T r (J_{g}^{- 1} J_{h})$

$\frac{1}{2} ln | J_{g} J_{g}^{T} (J_{h} J_{h}^{T})^{- 1} | + T r (J_{g}^{- 1} J_{h} (J_{g}^{- 1} J_{h})^{T}) - T r (J_{g}^{- 1} J_{h})$

And if we define $C = J_{g}^{- 1} J_{h}$ we get:

$- ln | C | + T r (C C^{T}) - T r (C)$

This seems to have the form of $- ln (C^{2}) + C^{2} - C$ , and in fact if we consider $P$ , $A$ , and $F$ to just be concatenations of variables, which maeans all the $J$ matrices are diagonal, we see that our equation has the form.

$- ln (\prod C) + \sum C^{2} - \sum C = \sum (- ln C + C^{2} - C)$

Which is a nice sanity check. The value of $O p$ is just the entropy difference $\frac{1}{2} ln Σ_{2} - \frac{1}{2} ln Σ_{1}$ which simplifies to $- ln | C |$ for free.

If the Jacobians are not invertible, but we assume that $J_{g} J_{g}^{T}$ is invertible, we instead get:

$I m p = \frac{1}{2} ln | J_{g} J_{g}^{T} | - \frac{1}{2} ln | J_{h} J_{h}^{T} | + T r (J_{h} J_{h}^{T} (J_{g} J_{g}^{T})^{- 1}) - T r (J_{g} J_{h}^{T} (J_{g} J_{g}^{T})^{- 1})$

Appendix C: Supplementary Plots

Other Ways to Visualize Impact Plots

Here I plotted the "Impact ratio" $I m p / {I m p}_{m i n}$ against ${I m p}_{m i n}$ :

The next six plots are just re-visualizations of previously plotted and discussed data, so I'm not going to write a bespoke description for each one.

Here I plotted "Excess Impact" $I m p - {I m p}_{m i n}$ against $I m p$ :

Example training runs from $n_{g o o d}$ = 4

There's 16 plots here, but none are important to the key point. Screen reader users I'm afraid you'll have to get someone else to nitpick the raw data.

Example figures summarizing training runs:

^{^}
https://statproofbook.github.io/P/mvn-ltt.html
^{^}
https://stats.stackexchange.com/questions/60680/kl-divergence-between-two-multivariate-gaussians
^{^}
https://statproofbook.github.io/P/mean-qf.html

0 comments

Comments sorted by top scores.

Measuring Learned Optimization in Small Transformer Models

Contents

Methods

Pretraining on Sequence Prediction

Measuring Optimization

Results

Optimization vs RL Success Rate

Cross-Objective Evaluation

Optimization vs Impact

Model Self-Evaluation

Conclusions

Alex Turner's Existing Work

Appendices

Appendix A: Impact and Differential Impact

Derivation of Imp

Differential Impact

Appendix B: Proofs

Derivation and proof of Imp≥Impmin

Derivation of Differential Impact

Derivation of Multivariate Differential Impact and Optimization

Appendix C: Supplementary Plots

Other Ways to Visualize Impact Plots

Example training runs from ngood = 4

Example figures summarizing training runs:

0 comments

Derivation of $I m p$

Derivation and proof of $I m p \geq {I m p}_{m i n}$

Example training runs from $n_{g o o d}$ = 4