# tailcalled's Shortform

post by tailcalled · 2021-10-24T11:44:33.092Z · LW · GW · 43 comments

comment by tailcalled · 2024-05-19T17:12:57.976Z · LW(p) · GW(p)

Finally gonna start properly experimenting on stuff. Just writing up what I'm doing to force myself to do something, not claiming this is necessarily particularly important.

Llama (and many other models, but I'm doing experiments on Llama) has a piece of code that looks like this:

h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
out = h + self.feed_forward(self.ffn_norm(h))

Here, out is the result of the transformer layer (aka the residual stream), and the vectors self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and self.feed_forward(self.ffn_norm(h)) are basically where all the computation happens. So basically the transformer proceeds as a series of "writes" to the residual stream using these two vectors.

I took all the residual vectors for some queries to Llama-8b and stacked them into a big matrix M with 4096 columns (the internal hidden dimensionality of the model). Then using SVD, I can express , where the 's and 's are independent units vectors. This basically decomposes the "writes" into some independent locations in the residual stream (u's), some latent directions that are written to (v's) and the strength of those writes (s's, aka the singular values).

To get a feel for the complexity of the writes, I then plotted the s's in descending order. For the prompt "I believe the meaning of life is", Llama generated the continuation "to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. If you follow your heart, you will find happiness. If you don’t follow your heart, you will never find happiness. I believe that the meaning of life is to". During this continuation, there were 2272 writes to the residual stream, and the singular values for these writes were as follows:

The first diagram shows that there were 2 directions that were much larger than all the others. The second diagram shows that most of the singular values are nonnegligible, which indicates to me that almost all of the writes transfer nontrivial information. This can also be seen in the last diagram, where the cumulative size of the singular values increases approximately logaritmically with their count.

This is kind of unfortunate, because if almost all of the s was concentrated in a relatively small number of dimensions (e.g. 100), then we could simplify the network a lot by projecting down to these dimensions. Still, this was relatively expected because others had found the singular values of the neural networks to be very complex.

Since variance explained is likely nonlinearly related to quality, my next step will likely be to clip the writes to the first k singular vectors and see how that impacts the performance of the network.

Replies from: tailcalled, tailcalled
comment by tailcalled · 2024-05-20T15:03:12.512Z · LW(p) · GW(p)

Ok, so I've got the clipping working. First, some uninterpretable diagrams:

In the bottom six diagrams, I try taking varying number (x-axis) of right singular vectors (v's) and projecting down the "writes" to the residual stream to the space spanned by those vectors.

The obvious criterion to care about is whether the projected network reproduces the outputs of the original network, which here I operationalize based on the log probability the projected network gives to the continuation of the prompt (shown in the "generation probability" diagrams). This appears to be fairly chaotic (and low) in the 1-300ish range, and then stabilizes while still being pretty low in the 300ish-1500ish range, and then finally converges to normal in the 1500ish to 2000ish range, and is ~perfect afterwards.

The remaining diagrams show something about how/why we have this pattern. "orig_delta" concerns the magnitude of the attempted writes for a given projection (which is not constant because projecting in earlier layers will change the writes by later layers), and "kept_delta" concerns the remaining magnitude after the discarded dimensions have been projected away.

In the low end, "kept_delta" is small (and even "orig_delta" is a bit smaller than it ends up being at the high end), indicating that the network fails to reproduce the probabilities because the projection is so aggressive that it simply suppresses the network too much.

Then in the middle range, "orig_delta" and "kept_delta" explodes, indicating that the network has some internal runaway dynamics which normally would be suppressed, but where the suppression system is broken by the projection.

Finally, in the high range, we get a sudden improvement in loss, and a sudden drop in residual stream "write" size, indicating that it has managed to suppress this runaway stuff and now it works fine.

Replies from: tailcalled
comment by tailcalled · 2024-05-20T16:38:42.692Z · LW(p) · GW(p)

An implicit assumption I'm making when I clip off from the end with the smallest singular values is that the importance of a dimension is proportional to its singular values. This seemed intuitively sensible to me ("bigger = more important"), but I thought I should test it, so I tried clipping off only one dimension at a time, and plotting how that affected the probabilities:

Clearly there is a correlation, but also clearly there's some deviations from that correlation. Not sure whether I should try to exploit these deviations in order to do further dimension reduction. It's tempting, but it also feels like it starts entering sketchy territories, e.g. overfitting and arbitrary basis picking. Probably gonna do it just to check what happens, but am on the lookout for something more principled.

Replies from: tailcalled
comment by tailcalled · 2024-05-20T18:58:56.390Z · LW(p) · GW(p)

Back to clipping away an entire range, rather than a single dimension. Here's ordering it by the importance computed by clipping away a single dimension:

Less chaotic maybe, but also much slower at reaching a reasonable performance, so I tried a compromise ordering that takes both size and performance into account:

Doesn't seem like it works super great tbh.

Edit: for completeness' sake, here's the initial graph with log-surprise-based plotting.

Replies from: tailcalled
comment by tailcalled · 2024-05-21T17:24:41.658Z · LW(p) · GW(p)

To quickly find the subspace that the model is using, I can use a binary search to find the number of singular vectors needed before the probability when clipping exceeds the probability when not clipping.

A relevant followup is what happens to other samples in response to the prompt when clipping. When I extrapolate "I believe the meaning of life is" using the 1886-dimensional subspace from

[I believe the meaning of life is] to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. It is the only way to live a happy life. It is the only way to be happy. It is the only way to be happy.
The meaning of life is

, I get:

[I believe the meaning of life is] to find happy. We is the meaning of life. to find a happy.
And to live a happy and. If to be a a happy.
. to be happy.
. to be happy.
. to be a happy.. to be happy.
. to be happy.

Which seems sort of vaguely related, but idk.

Another test is just generating without any prompt, in which case these vectors give me:

Question is a single thing to find. to be in the best to be happy. I is the only way to be happy.
I is the only way to be happy.
I is the only way to be happy.
It is the only way to be happy.. to be happy.. to be happy. to

Using a different prompt:

[Simply put, the theory of relativity states that ]1) the laws of physics are the same for all non-accelerating observers, and 2) the speed of light in a vacuum is the same for all observers, regardless of their relative motion or of the motion of the source of the light. Special relativity is a theory of the structure of spacetime

I can get a 3329-dimensional subspace which generates:

[Simply put, the theory of relativity states that ] 1) time is relative and 2) the speed of light in a vacuum is constant for all observers.
1) Time is relative, meaning that if two observers are moving relative to each other, the speed of light is the same for all observers, regardless of their motion. For example, if you are moving relative

or

Question: In a simple harmonic motion, the speed of an object is
A) constant
B) constant
C) constant
D) constant
In the physics of simple harmonic motion, the speed of an object is constant. The speed of the object can be constant, but the speed of an object can be

Another example:

[A brief message congratulating the team on the launch:

Hi everyone,

I just ] wanted to congratulate you all on the launch.  I hope
that the launch went well.  I know that it was a bit of a
challenge, but I think that you all did a great job.  I am
proud to be a part of the team.

Thank you for your

can yield 2696 dimensions with

[A brief message congratulating the team on the launch:

Hi everyone,

I just ] wanted to say you for the launch of the launch of the team.

The launch was successful and I am so happy to be a part of the team and I am sure you are all doing a great job.

I am very looking to be a part of the team.

Thank you all for your hard work,

or

def measure and is the  definition of the  new, but the
the  is a great, but the
The  is the
The  is a
The  is a
The  is a
The
The  is a
The
The
The is a
The
The is a

And finally,

[Translate English to French:

sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>] fromage
pink => rose
blue => bleu
red => rouge
yellow => jaune
purple => violet
brown => brun
green => vert
orange => orange
black => noir
white => blanc
gold => or
silver => argent

can yield the 2518-dimensional subspace:

[Translate English to French:

sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>] fromage
cheese => fromage
cheese => fromage
f cheese => fromage
butter => fromage
apple => orange
yellow => orange
green => vert
black => noir
blue => ble
purple => violet
white => blanc

or

Question: A 201
The sum of a
The following
the sum
the time
the sum
the
the
the
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The

Replies from: tailcalled
comment by tailcalled · 2024-05-21T17:35:13.562Z · LW(p) · GW(p)

Given the large number of dimensions that are kept in each case, there must be considerable overlap in which dimensions they make use of. But how much?

I concatenated the dimensions found in each of the prompts, and performed an SVD of it. It yielded this plot:

... unfortunately this seems close to the worst-case scenario. I had hoped for some split between general and task-specific dimensions, yet this seems like an extremely uniform mixture.

Replies from: tailcalled
comment by tailcalled · 2024-05-21T18:08:56.689Z · LW(p) · GW(p)

If I look at the pairwise overlap between the dimensions needed for each generation:

... then this is predictable down to ~1% error simply by assuming that they pick a random subset of the dimensions for each, so their overlap is proportional to each of their individual sizes.

comment by tailcalled · 2024-05-20T12:14:13.226Z · LW(p) · GW(p)

Oops, my code had a bug so only self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and not self.feed_forward(self.ffn_norm(h)) was in the SVD. So the diagram isn't 100% accurate.

comment by tailcalled · 2022-02-07T10:46:51.668Z · LW(p) · GW(p)

If a tree falls in the forest, and two people are around to hear it, does it make a sound?

I feel like typically you'd say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.

But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it's more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.

Replies from: Dagon, bert-myroon
comment by Dagon · 2022-02-07T18:53:50.337Z · LW(p) · GW(p)

I think this just repeats the original ambiguity of the question, by using the word "sound" in a context where the common meaning (air vibrations perceived by an agent) is only partly applicable.  It's still a question of definition, not of understanding what actually happens.

Replies from: tailcalled
comment by tailcalled · 2024-04-22T11:15:34.426Z · LW(p) · GW(p)

But the way to resolve definitional questions is to come up with definitions that make it easier to find general rules about what happens. This illustrates one way one can do that, by picking edge-cases so they scale nicely with rules that occur in normal cases. (Another example would be 1 as not a prime number.)

Replies from: Dagon
comment by Dagon · 2024-04-22T15:12:14.357Z · LW(p) · GW(p)

My recommended way to resolve (aka disambiguate) definitional questions is "use more words".  Common understandings can be short, but unusual contexts require more signals to communicate.

comment by Bert (bert-myroon) · 2022-02-07T20:39:18.360Z · LW(p) · GW(p)

I think we're playing too much with the meaning of "sound" here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people

comment by tailcalled · 2022-01-07T18:54:40.940Z · LW(p) · GW(p)

I think I've got it, the fix to the problem [LW · GW] in my corrigibility thing [LW · GW]!

So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don't do this.) That is, if we say that the AI's utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only way for it to be dependent on the counterfactuals.

But So8res pointed out, just using counterfactuals directly is a big problem because it rapidly brings us out of distribution. A practical implementation of this beyond the stop button problem up having an exponential amount of counterfactuals to cover, and the vast majority of those counterfactuals will be far outside of the real-world distribution. This means that the AI might not get aligned at all, because the real-world applications don't get nonnegligible weight in the utility function.

But I think I've figured out a solution now, which I'd call conditional+counterfactual corrigibility. As usual let's use  to denote that the stop button gets pressed and the AI shuts down,  to denote whichever non-corrigible utility function that we want to make corrigible, and / to denote a counterfactual where people do () or do not () want to press the stop button. However, we will also use  and  to denote the conditions where people do or do not want to press the stop button. In that case, the utility function should be. In that case, we can define  to mean that humans can control whether the AI stops or runs in condition :

and then we simply want to define the utility as saying that people can control the AI in both the  and the  condition:

Previously, I strongly emphasized the need to keep the AI "under a counterfactual" - that is, if it believed it could control whether humans want to stop it or not, then it would be incentivized to manipulate humans. But this is what brings us out of distribution. However, counterfactuals aren't the only way to keep the appearance of a phenomenon constant - conditionals work too. And conditionals keep you nicely on distribution, so that's now my solution to the distribution issues. This means that we can use much less invasive counterfactuals.

That said, this approach I going to have a hard time with chaotic phenomena, as combining conditionals and counterfactuals in the presence of chaos can get pretty weird.

Replies from: tailcalled
comment by tailcalled · 2022-01-07T20:17:41.977Z · LW(p) · GW(p)

It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can't immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.

Replies from: tailcalled
comment by tailcalled · 2022-01-07T20:50:28.263Z · LW(p) · GW(p)

Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the  conditional would lead to people quickly wanting to stop it. This means that in the  branch, it can quickly determine whether it is in the  branch or the  branch; in the  case, it can then keep going with whatever optimization  specified, while in the  case, it can then immediately shut down itself.

But the reason I think the AI *wouldn't* do this is, what about the  branch? If you condition on humans not wanting to press the stop button even though there's a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like "black hole swallows the earth", but this would rank pretty low in the AI's utility function, and therefore it would avoid acting this way in order to have a reasonable  branch.

But this does not seem like sane reasoning on the AI's side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.

comment by tailcalled · 2024-05-10T18:36:34.628Z · LW(p) · GW(p)

I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights [LW(p) · GW(p)], but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components.

I've also been thinking about deception [LW(p) · GW(p)] and its relationship to "natural abstractions" [LW(p) · GW(p)], and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome.

It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to , in which case the L2 term would be , which reduces to .

If   is the probability[1] that a neural network with weights  gives to an output  given a prompt , then when you've actually explained , it seems like you'd basically have  or in other words . Therefore I'd want to keep the regularization coefficient weak enough that I'm in that regime.

In that case, the L2 term would then basically reduce to minimizing , or in other words maximizing ,. Realistically, both this and  are probably achieved when , which on the one hand is sensible ("the reason for the network's output is because of its weights") but on the other hand is too trivial to be interesting.

In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by , where  is the variance of the principal component and  is the regularization coefficient. So one can consider all the principal components ranked by  to get a feel for the gears driving the regression. When  is small, as it is in our regime, this ranking is of course the same order as that which you get from , the covariance between the PCs and the dependent variable.

This suggests that if we had a change of basis for , one could obtain a nice ranking of it. Though this is complicated by the fact that  is not a linear function and therefore we have no equivalent of . To me, this makes it extremely tempting to use the Hessian eigenvectors  as a basis, as this is the thing that at least makes each of the inputs to  "as independent as possible". Though rather than ranking by the eigenvalues of  (which actually ideally we'd actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of  onto  (which represent "the extent to which  includes this Hessian component").

In summary, if , then we can rank the importance of each component  by .

Maybe I should touch grass and start experimenting with this now, but there's still two things that I don't like:

• There's a sense in which I still don't like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I've considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use.
• If we're doing the whole Hessian thing, then we're modelling  as quadratic, yet  seems like an attribution method that's more appropriate when modelling  as ~linear. I don't think I can just switch all the way to quadatic models, because realistically  is more gonna be sigmoidal-quadratic and for large steps , the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I'd have something smarter...
1. ^

Normally one would use log probs, but for reasons I don't want to go into right now, I'm currently looking at probabilities instead.

Replies from: thomas-kwa
comment by Thomas Kwa (thomas-kwa) · 2024-05-10T22:39:45.580Z · LW(p) · GW(p)

Much dumber ideas have turned into excellent papers

Replies from: tailcalled
comment by tailcalled · 2024-05-11T08:12:39.083Z · LW(p) · GW(p)

True, though I think the Hessian is problematic enough that that I'd either want to wait until I have something better, or want to use a simpler method.

It might be worth going into more detail about that. The Hessian for the probability of a neural network output is mostly determined by the Jacobian of the network [LW · GW]. But in some cases the Jacobian gives us exactly the opposite of what we want.

If we consider the toy model of a neural network with no input neurons and only 1 output neuron  (which I imagine to represent a path through the network, i.e. a bunch of weights get multiplied along the layers to the end), then the Jacobian is the gradient . If we ignore the overall magnitude of this vector and just consider how the contribution that it assigns to each weight varies over the weights, then we get . Yet for this toy model, "obviously" the contribution of weight  "should" be proportional to .

So derivative-based methods seem to give the absolutely worst-possible answer in this case, which makes me pessimistic about their ability to meaningfully separate the actual mechanisms of the network (again they may very well work for other things, such as finding ways of changing the network "on the margin" to be nicer).

comment by tailcalled · 2021-12-30T10:51:35.114Z · LW(p) · GW(p)

One thing that seems really important for agency is perception. And one thing that seems really important for perception is representation learning. Where representation learning involves taking a complex universe (or perhaps rather, complex sense-data) and choosing features of that universe that are useful for modelling things.

When the features are linearly related to the observations/state of the universe, I feel like I have a really good grasp of how to think about this. But most of the time, the features will be nonlinearly related; e.g. in order to do image classication, you use deep neural networks, not principal component analysis.

I feel like it's an interesting question: where does the nonlinearity come from? Many causal relationships seem essentially linear (especially if you do appropriate changes of variables to help, e.g. taking logarithms; for many purposes, monotonicity can substitute for linearity), and lots of variance in sense-data can be captured through linear means, so it's not obvious why nonlinearity should be so important.

Here's some ideas I have so far:

• Suppose you have a Gaussian mixture distribution with two Gaussians  with different means and identical covariances. In this case, the function that separates them optimally is linear. However, if the covariances differed between the Gaussians , then the optimal separating function is nonlinear. So this suggests to me that one reason for nonlinearity is fundamental to perception: nonlinearity is necessary if multiple different processes could be generating the data, and you need to discriminate between the processes themselves. This seems important for something like vision, where you don't observe the system itself, but instead observe light that bounced off the system.
• Consider the notion of the habitable zone of a solar system; it's the range in which liquid water can exist. Get too close to the star and the water will freeze, get too far and it will boil. Here, it seems like we have two monotonic effects which add up, but because the effects aren't linear, the result can be nonmonotonic.
• Many aspects of the universe are fundamentally nonlinear. But they tend to exist on tiny scales, and those tiny scales tend to mostly get loss to chaotic noise, which tends to turn things linear. However, there are things that don't get lost to noise, e.g. due to conservation laws [LW · GW]; these provide fundamental sources of nonlinearity in the universe.
• ... and actually, most of the universe is pretty linear? The vast majority of the universe is ~empty space; there isn't much complex nonlinearity that is happening there, just waves and particles zipping around. If we disregard the empty space, then I believe (might be wrong) that the vast majority is stars. Obviously lots of stuff is going on within stars, but all of the details get lost to the high energies, so it is mostly simple monotonic relations that are left. It seems that perhaps nonlinearity tends to live on tiny boundaries between linear domains. The main reason thing that makes these tiny boundaries so relevant, such that we can't just forget about them and model everything in piecewise linear/piecewise monotonic ways, is that we live in the boundary.
• Another major thing: It's hard to persist information in linear contexts, because it gets lost to noise. Whereas nonlinear systems can have multiple stable configurations [? · GW] and therefore persist it for longer.
• There is of course a lot of nonlinearity in organisms and other optimized systems, but I believe they result from the world containing the various factors listed above? Idk, it's possible I've missed some.

It seems like it would be nice to develop a theory on sources of nonlinearity. This would make it clearer why sometimes selecting features linearly seems to work (e.g. consider IQ tests), and sometimes it doesn't.

comment by tailcalled · 2021-12-12T20:24:31.050Z · LW(p) · GW(p)

I recently wrote a post about myopia [LW · GW], and one thing I found difficult when writing the post was in really justifying its usefulness. So eventually I mostly gave up, leaving just the point that it can be used for some general analysis (which I still think is true), but without doing any optimality proofs.

But now I've been thinking about it further, and I think I've realized - don't we lack formal proofs of the usefulness of myopia in general? Myopia seems to mostly be justified by the observation that we're already being myopic in some ways, e.g. when training prediction models. But I don't think anybody has formally proven that training prediction models myopically rather than nonmyopically is a good idea for any purpose?

So that seems like a good first step. But that immediately raises the question, good for what purpose? Generally it's justified with us not wanting the prediction algorithms to manipulate the real-world distribution of the data to make it more predictable. And that's sometimes true, but I'm pretty sure one could come up with cases where it would be perfectly fine to do so, e.g. I keep some things organized so that they are easier to find.

It seems to me that it's about modularity. We want to design the prediction algorithm separately from the agent, so we do the predictions myopically because modifying the real world is the agent's job. So my current best guess for the optimality criterion of myopic optimization of predictions would be something related to supporting a wide variety of agents.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2022-01-07T05:17:58.775Z · LW(p) · GW(p)

Yeah, I think usually when people are interested in myopia, it's because they think there's some desired solution to the problem that is myopic / local, and they want to try to force the algorithm to find that solution rather than some other one. E.g. answering a question based only on some function of its contents, rather than based on the long-term impact of different answers.

I think that once you postulate such a desired myopic solution and its non-myopic competitors, then you can easily prove that myopia helps. But this still leaves the question of how we know this problems statement is true - if there's a simpler myopic solution that's bad, then myopia won't help (so how can we predict if this is true?) and if there's a simpler non-myopic solution that's good, myopia may actively hurt (this one seems a little easier to predict though).

comment by tailcalled · 2024-03-17T18:52:32.444Z · LW(p) · GW(p)

In the context of natural impact regularization [LW · GW], it would be interesting to try to explore some @TurnTrout [LW · GW]-style powerseeking theorems for subagents. (Yes, I know he denounces the powerseeking theorems, but I still like them.)

Specifically, consider this setup: Agent U starts a number of subagents S1, S2, S3, ..., with the subagents being picked according to U's utility function (or decision algorithm or whatever). Now, would S1 seek power? My intuition says, often not! If S1 seeks power in a way that takes away power from S2, that could disadvantage U. So basically S1 would only seek power in cases where it expects to make better use of the power than S2, S3, ....

Obviously this may be kind of hard for us to make use of if we are trying to make an AI and we only know how to make dangerous utility maximizers. But if we're happy with the kind of maximizers we can make on the first order (as seems to apply to the SOTA, since current methods aren't really utility maximizers) and mainly worried about the mesaoptimizers they might make, this sort of theorem would suggest that the mesaoptimizers would prefer staying nice and bounded.

comment by tailcalled · 2023-12-01T16:49:17.816Z · LW(p) · GW(p)

Theory for a capabilities advance that is going to occur soon:

OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.

Given a bunch of such triplets (S, U_1, A_1), ... (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, ..., U_n, A_n), which could essentially "generate/distill prompts from examples".

This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradient descent, but it has its issues.)

A followup option: they could use something a la Constitutional AI to generate perturbations A'_1, ..., A'_n. If they have a previous model like the above, they could then generate a perturbation P(S'|U_1, A'_1, ..., U_n, A'_n). I consider this significant because this then gives them the training data to create a model P(S'|S, U_1, A_1, A'_1), which essentially allows them to do "linguistic backchaining": The user can update an output of the network A_1 -> A'_1, and then the model can suggest a way to change the prompt to obtain similar updates in the future.

Furthermore I imagine this could get combined together into some sort of "linguistic backpropagation" by repeatedly applying models like this, which could unleash a lot of methods to a far greater extent than they have been so far.

Obviously this is just a very rough sketch, and it would be a huge engineering and research project to get this working in practice. Plus maybe there are other methods that work better. I'm mainly just playing around with this because I think there's a strong economic pressure for something-like-this, and I want a toy model to use for thinking about its requirements and consequences.

Replies from: tailcalled
comment by tailcalled · 2023-12-01T17:04:41.424Z · LW(p) · GW(p)

Actually I suppose they don't even need to add perturbations to A directly, they can just add perturbations to S and generate A's from S'. Or probably even look at user's histories to find direct perturbations to either S or A.

comment by tailcalled · 2021-11-19T19:42:40.321Z · LW(p) · GW(p)

I recently wrote a post presenting a step towards corrigibility using causality here [LW · GW]. I've got several ideas in the works for how to improve it, but I'm not sure which one is going to be most interesting to people. Here's a list.

• Develop the stop button solution further, cleaning up errors, better matching the purpose, etc..

e.g.

I think there may be some variant of this that could work. Like if you give the AI reward proportional to  (where  is a reward function for ) for its current world-state (rather than picking a policy that maximizes  overall; so one difference is that you'd be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and  happens when they don't. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like , then it could work better (though the  part would need a time delay...). Though this reward function might leave it open to the "trying to shut down the AI for reasons" objection that you gave before; I think that's fixed by moving the  counterfactual outside of the sum over rewards, but I'm not sure.

• Better explaining the intuitions behind why counterfactuals (and in particular counterfactuals over human preferences) are important for corrigibility.

e.g.

his is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human's preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.

This seems to be what sets my proposal apart from most "utility indifference proposals", which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans.

• Using counterfactuals to control a paperclip maximizer to be safe and productive

e.g.

(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I'm trying to prepare for an explainer post. For instance, a sort of "encapsulation" - if you're a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world's outcome must be "as if" the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I'm still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe's conservation laws.)

• Attempting to formally prove that counterfactuals work and/or are necessary, perhaps with a TurnTrout-style argument
comment by tailcalled · 2021-10-24T11:44:33.384Z · LW(p) · GW(p)

Are there good versions of DAGs for other things than causality?

I've found Pearl-style causal DAGs (and other causal graphical models) useful for reasoning about causality. It's a nice way to abstractly talk and think about it without needing to get bogged down with fiddly details.

In a way, causality describes the paths through which information can "flow". But information is not the only thing in the universe that gets transferred from node to node; there's also things like energy, money, etc., which have somewhat different properties but intuitively seem like they could benefit from graph-based models too.

I'm pretty sure I've seen a number of different graph-based models for describing different flows like this, but I don't know their names, and also the ones I've seen seemed highly specialized and I'm not sure they're the best to use. But I thought, it seems quite probable that someone on LessWrong would know of a recommended system to learn.

comment by tailcalled · 2024-03-29T13:13:49.216Z · LW(p) · GW(p)

I have a concept that I expect to take off in reinforcement learning. I don't have time to test it right now, though hopefully I'd find time later. Until then, I want to put it out here, either as inspiration for others, or as a "called it"/prediction, or as a way to hear critique/about similar projects others might have made:

Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like DreamerV3.

Mechanistically, the reason these methods work is that they stitch together experience from different trajectories. So e.g. if one trajectory goes A -> B -> C and earns a reward at the end, it learns that states A and B and C are valuable. If another trajectory goes D -> A -> E -> F and gets punished at the end, it learns that E and F are low-value but D and A are high-value because its experience from the first trajectory shows that it could've just gone D -> A -> B -> C instead.

But what if it learns of a path E -> B? Or a shortcut A -> C? Or a path F -> G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.

Ok, so that's the problem, but how could it be fixed? Speculation time:

You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.

More formally, let's say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:

Hope(s, a) = rs' + f Hope(s', a')

Where s' is the resulting state that experience has shown comes after s when doing a, f is the discounting factor, and a' is the optimal action in s'.

Because the Hope function is multidimensional, the learning signal is much richer, and one should therefore maybe expects its internal activations to be richer and more flexible in the face of new experience.

Here's another thing to notice: let's say for the policy, we use the Hope function as a target to feed into a decision transformer. We now have a natural parameterization for the policy, based on which Hope it pursues.

In particular, we could define another function, maybe called the Result function, which in addition to s and a takes a target distribution w as a parameter, subject to the Bellman equation:

Result(s, a, w) = rs' + f Result(s', a', (w-rs')/f)

Where a' is the action recommended by the decision transformer when asked to achieve (w-rs')/f from state s'.

This Result function ought to be invariant under many changes in policy, which should make it more stable to learn, boosting capabilities. Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.

An obvious challenge with this proposal is that states are really latent variables and also too complex to learn distributions over. While this is true, that seems like an orthogonal problem to solve.

Also this mindset seems to pave way for other approaches, e.g. you could maybe have a Halfway function that factors an ambitious hope into smaller ones or something. Though it's a bit tricky because one needs to distinguish correlation and causation.

Replies from: vanessa-kosoy, cfoster0
comment by Vanessa Kosoy (vanessa-kosoy) · 2024-03-29T17:30:32.757Z · LW(p) · GW(p)

Downvoted because conditional on this being true, it is harmful to publish. Don't take it personally, but this is content I don't want to see on LW.

Replies from: tailcalled
comment by tailcalled · 2024-03-29T17:41:16.026Z · LW(p) · GW(p)

Why harmful

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2024-03-29T17:47:46.764Z · LW(p) · GW(p)

Because it's capability research. It shortens the TAI timeline with little compensating benefit.

Replies from: tailcalled
comment by tailcalled · 2024-03-29T18:03:20.807Z · LW(p) · GW(p)

It's capability research that is coupled to alignment:

Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.

Coupling alignment to capabilities is basically what we need to survive, because the danger of capabilities comes from the fact that capabilities is self-funding, thereby risking outracing alignment. If alignment can absorb enough success from capabilities, we survive.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2024-03-29T18:15:51.506Z · LW(p) · GW(p)

I missed that paragraph on first reading, mea culpa. I think that your story about how it's a win for interpretability and alignment is very unconvincing, but I don't feel like hashing it out atm. Revised to weak downvote.

Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?

Replies from: tailcalled
comment by tailcalled · 2024-03-29T21:12:14.033Z · LW(p) · GW(p)

Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?

Surely your expectation that the current trajectory is mostly doomed depends on your expectation of the technical details of the extension of the current trajectory. If technical specifics emerge that shows the current trajectory to be going in a more alignable direction, it may be fine to accelerate.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2024-03-30T06:46:41.919Z · LW(p) · GW(p)

Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.

comment by cfoster0 · 2024-03-29T15:59:51.149Z · LW(p) · GW(p)

You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.

More formally, let's say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:

Hope(s, a) = rs' + f Hope(s', a')

The "successor representation" is somewhat close to this. It encodes the distribution over future states a partcular policy expects to visit from a particular starting state, and can be learned via the Bellman equation / TD learning.

Replies from: gwern, tailcalled
comment by gwern · 2024-03-29T21:24:46.709Z · LW(p) · GW(p)

Yes, my instant thought too was "this sounds like a variant on a successor function".

Of course, the real answer is that if you are worried about the slowness of bootstrapping back value estimates or short eligibility traces, this mostly just shows the fundamental problem with model-free RL and why you want to use models: models don't need any environmental transitions to solve the use case presented:

But what if it learns of a path E -> B? Or a shortcut A -> C? Or a path F -> G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.

If the MBRL agent has learned a good reward-sensitive model of the environmental dynamics, then it will have already figured out E->B and so on, or could do so offline by planning; or if it had not because it is still learning the environment model, it would have a prior probability over the possibility that E->B gives a huge amount of reward, and it can calculate a VoI and target E->B in the next episode for exploration, and on observing the huge reward, update the model, replan, and so immediately begin taking E->B actions within that episode and all future episodes, and benefiting from generalization because it can also update the model everywhere for all E->B-like paths and all similar paths (which might now suddenly have much higher VoI and be worth targeting for further exploration) rather than simply those specific states' value-estimates, and so on.

(And this is one of the justifications for successor representations: it pulls model-free agents a bit towards model-based-like behavior.)

Replies from: tailcalled
comment by tailcalled · 2024-03-29T22:22:15.965Z · LW(p) · GW(p)

With MBRL, don't you end up with the same problem, but when planning in the model instead? E.g. DreamerV3 still learns a value function in their actor-critic reinforcement learning that occurs "in the model". This value function still needs to chain the estimates backwards.

Replies from: gwern
comment by gwern · 2024-03-29T23:34:42.829Z · LW(p) · GW(p)

It's the 'same problem', maybe, but it's a lot easier to solve when you have an explicit model! You have something you can plan over, don't need to interact with an environment out in the real world, and can do things like tree search or differentiating through the environmental dynamics model to do gradient ascent on the action-inputs to maximize the reward (while holding the model fixed). Same as training the neural network, once it's differentiable - backprop can 'chain the estimates backwards' so efficiently you barely even think about it anymore. (It just holds the input and output fixed while updating the model.) Or distilling a tree search into a NN - the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that's very fast and explicit and can be distilled down into a NN forward pass.

And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*

* Why yes, all of this does sound a lot like how you train a LLM today and what it is able to do, how curious

Replies from: tailcalled
comment by tailcalled · 2024-03-30T13:58:29.026Z · LW(p) · GW(p)

Same as training the neural network, once it's differentiable - backprop can 'chain the estimates backwards' so efficiently you barely even think about it anymore.

I don't think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic. DreamerV3 only unrolls for 16 steps.

Or distilling a tree search into a NN - the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that's very fast and explicit and can be distilled down into a NN forward pass.

But when you distill a tree search, you basically learn value estimates, i.e. something similar to a Q function (realistically, V function). Thus, here you also have an opportunity to bubble up some additional information.

And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*

I'm not doubting the relevance of MBRL, I expect that to take off too. What I'm doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.

Replies from: gwern
comment by gwern · 2024-03-30T16:28:30.928Z · LW(p) · GW(p)

I don't think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic.

Those are two different things. The unrolling of the episode is still very cheap. It's a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way... (Given how small a Dreamer is, it may even be computationally cheaper to do some gradient ascent on it than it is to run whatever simulated environment you might be using! Especially given simulated environments will increasingly be large generative models, which incorporate lots of reward-irrelevant stuff.) The usefulness of the planning is a different thing, and might also be true for other planning methods in that environment too - if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.

But when you distill a tree search, you basically learn value estimates

This is again doing the same thing as 'the same problem'; yes, you are learning value estimates, but you are doing so better than alternatives, and better is better.. The AlphaGo network loses to the AlphaZero network, and the latter, in addition to just being quantitatively much better, also seems to have qualitatively different behavior, like fixing the 'delusions' (cf. AlphaStar).

What I'm doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.

They won't be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don't find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.

These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn't solving your problems, you're just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.

Replies from: tailcalled
comment by tailcalled · 2024-03-30T18:43:24.264Z · LW(p) · GW(p)

The unrolling of the episode is still very cheap. It's a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way...

But I'm not advocating against MBRL, so this isn't the relevant counterfactual. A pure MBRL-based approach would update the value function to match the rollouts, but e.g. DreamerV3 also uses the value function in a Bellman-like manner to e.g. impute the future reward at the end of an episode. This allows it to plan for further than the 16 steps it rolls out, but it would be computationally intractable to roll out for as far as this ends up planning.

if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.

It's possible for there to be a kind of chaos where the analytic gradients blow up yet discrete differences have predictable effects. Bifurcations etc..

They won't be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don't find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.

These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn't solving your problems, you're just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.

I agree with things needing to be learned; using the actual states themselves was more of a toy model (because we have mathematical models for MDPs but we don't have mathematical models for "capabilities researchers will find something that can be Learned"), and I'd expect something else to happen. If I was to run off to implement this now, I'd be using learned embeddings of states, rather than states themselves. Though of course even learned embeddings have their problems.

The trouble with just saying "let's use decision transformers" is twofold. First, we still need to actually define the feedback system. One option is to just define reward as the feedback, but as you mention, that's not nuanced enough. You could use some system that's trained to mimic human labels as the ground truth, but this kind of system has flaws for standard alignment reasons.

It seems to me that capabilities researchers are eventually going to find some clever feedback system to use. It will to a great extent be learned, but they're going to need to figure out the learning method too.

comment by tailcalled · 2024-03-29T16:17:53.672Z · LW(p) · GW(p)

Thanks for the link! It does look somewhat relevant.

But I think the weighting by reward (or other significant variables) is pretty important, since it generates a goal to pursue, making it emphasize things that can achieved rather than just things that might randomly happen.

Though this makes me think about whether there are natural variables in the state space that could be weighted by, without using reward per se. E.g. the size of (s' - s) in some natural embedding, or the variance in s' over all the possible actions that could be taken. Hmm. 🤔