Posts
Comments
Thanks. I really like this task!
It's hard for me to interpret these results without some indication of how good these networks actually are at the task though. E.g. it is possible that even though a network could solve a length=N task once out of however many attempts you made, that it just got lucky, or is running some other heuristic that just happens to work for that one time. I understand why you were interested in how things scale with length of problem given your interest in recurrence and processing depth. But would it be hard to make a plot where x axis is length of problem, and y axis is accuracy or loss?
What is the y-axis in your plots? Where would 100% accuracy be?
Thanks for writing this! It's not easy to keep up with progress, and posts like this make it easier.
One thing I am confused about: especially in cases of developer sandbagging, my intuition is that the mechanisms underlying the underperformance could be very similar to cases of "accidental" sandbagging (ie not sandbagging according to your def). More operationally, your example 1 and example 4 might have the same underlying issue from the perspective of the model itself, and if we want to find technical solutions to those particular examples they might look the same. If that's the case then it's not obvious to me that the "strategic" condition is a useful place to "cut nature at its joints."
Or to say it a different way, what operationally defines the difference between example 1 and 4 is that in ex.1 there is fine-tuning on a different dataset, and on ex.4 the extra dataset is part of the pre-training dataset. The model itself doesn't see the intent of the developer directly, so as far as technical solutions that only depend on the model itself, it's not obvious that the intent of the developer matters.
A developer could intentionally inject noisy and error-prone data into training, but the model would treat that equivalently to the case of where it was in the dataset by mistake.
Did the original paper do any shuffle controls? Given your results I suspect such controls would have failed. For some reason this is not standard practice in AI research, despite it being extremely standard in other disciplines.
https://pypi.org/project/fancy-einsum/ there's also this.
Thanks this was clarifying. I am wondering if you agree with the following (focusing on the predictive processing parts since that's my background):
There are important insights and claims from religious sources that seem to capture psychological and social truths that aren't yet fully captured by science. At least some of these phenomenon might be formalizable via a better understanding of how the brain and the mind work, and to that end predictive processing (and other theories of that sort) could be useful to explain the phenomenon in question.
You spoke of wanting formalization but I wonder if the main thing is really the creation of a science, though of course math is a very useful tool to do science with and to create a more complete understanding. At the end of the day we want our formalizations to comport to reality - whatever aspects of reality we are interested in understanding.
which is being able to ground the apparently contradictory metaphysical claims across religions into a single mathematical framework.
Is there a minimal operationalized version of this? Something that is the smallest formal or empirical result one could have that would count to you as small progress towards this goal?
Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there's a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn't fit in a concrete way right now, maybe there's room to extend/modify things to combine things in a fruitful way? Any thoughts?
I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I'd be after in terms of what I can build is a talking machine of a certain toyish flavor - a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn't know until our understanding is more complete.
I'm wondering if you have any other pointers to lessong/methods you think are valuable from neuroscience?
This makes a lot of sense to me, and makes me want to figure out exactly how to operationalize and rigorously quantify depth of search in LLMs! Quick thought is that it should have something to do with the spectrum of the transition matrix associated with the mixed state presentation (MSP) of the data generating process, as in Transformers Represent Belief State Geometry in their Residual Stream . The MSP describes synchronization to the hidden states of the data generating process, and that feels like a search process that has max-depth of the Markov order of the data generating process.
I really like the idea that memorization and this more lofty type of search are on a spectrum, and that placement on this spectrum has implications for capabilities like generalization. If we can figure out how to understand these things a more formally/rigorously that would be great!
I can report my own feelings with regards to this. I find cities (at least the American cities I have experience with) to be spiritually fatiguing. The constant sounds, the lack of anything natural, the smells - they all contribute to a lack of mental openness and quiet inside of myself.
The older I get the more I feel this.
Jefferson had a quote that might be related, though to be honest I'm not exactly sure what he was getting at:
I think our governments will remain virtuous for many centuries; as long as they are chiefly agricultural; and this will be as long as there shall be vacant lands in any part of America. When they get piled upon one another in large cities, as in Europe, they will become corrupt as in Europe. Above all things I hope the education of the common people will be attended to; convinced that on their good sense we may rely with the most security for the preservation of a due degree of liberty.
One interpretation of this is that Jefferson thought there was something spiritually corrupting of cities. This supported by another quote:
I view great cities as pestilential to the morals, the health and the liberties of man. true, they nourish some of the elegant arts; but the useful ones can thrive elsewhere, and less perfection in the others with more health virtue & freedom would be my choice.
although like you mention, there does seem to be some plausible connection to disease.
I've also noticed this phenomenon. I wonder if a solution would be to have an initial period where votes are considered more democratically, and then after that period the influence of high-karma users are applied (including back applying the influence of votes that occured during the intial period). I can also imagine downsides to this.
We've decided to keep the hackathon as scheduled. Hopefully there will be other opportunities in the future for those that can't make it this time!
Thanks! In my experience Computational Mechanics has many of those types of technical insights. My background is in neuroscience and in that context it really helped me think about computation in brains, and design experiments. Now I'm excited to use Comp Mech in a more concrete and deeper way to understand how artificial neural network internal structures relate to their behavior. Hopefully this is just the start!
Also a good point. Thanks
No, thanks for pointing this out
Lengthening from what to what?
This is a great question, and one of the things I'm most excited about using this framework to study in the future! I have a few ideas but nothing to report yet.
But I will say that I think we should be able to formalize exactly what it would mean for a transformer to create/discover new knowledge, and also to apply the structure from one dataset and apply it to another, or to mix two abstract structures together, etc. I want to have an entire theory of cognitive abilities and the geometric internal structures that support them.
If I'm understanding your question correctly, then the answer is yes, though in practice it might be difficult (I'm actually unsure how computationally intensive it would be, haven't tried anything along these lines yet). This is definitely something to look into in the future!
It's surprising for a few reasons:
- The structure of the points in the simplex is NOT
- The next token prediction probabilities (ie. the thing we explicitly train the transformer to do)
- The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not)
The first would be not surprising because it's literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say "model of the world." But the MSP structure is neither of those things. It's the structure of inference over the model of the world, which is quite a different beast than the model of the world.
Others might not find it as surprising as I did - everyone is working off their own intuitions.
edit: also I agree with what Kave said about the linear representation.
A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don't know how much of the phenomenon depends on the particular semantics of terms like "goal" and "deception" and "lie" (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I'm thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).
This all looks correct to me! Thanks for this.
Thanks John and David for this post! This post has really helped people to understand the full story. I'm especially interested in thinking more about plans for how this type of work can be helpful for AI safety. I do think the one you presented here is a great one, but I hope there are other potential pathways. I have some ideas, which I'll present in a post soon, but my views on this are still evolving.
Thanks! I'll have more thorough results to share about layer-wise reprsentations of the MSP soon. I've already run some of the analysis concatenating over all layers residual streams with RRXOR process and it is quite interesting. It seems there's a lot more to explore with the relationship between number of states in the generative model, number of layers in the transformer, residual stream dimension, and token vocab size. All of these (I think) play some role in how the MSP is represented in the transformer. For RRXOR it is the case that things look crisper when concatenating.
Even for cases where redundant info is discarded, we should be able to see the distinctions somewhere in the transformer. One thing I'm keen on really exploring is such a case, where we can very concretely follow the path/circuit through which redundant info is first distinguished and then is collapsed.
That is a fair summary.
Thanks!
- one way to construct an HMM is by finding all past histories of tokens that condition the future tokens with the same probablity distribution, and make that equivalence class a hidden state in your HMM. Then the conditional distributions determine the arrows coming out of your state and which state you go to next. This is called the "epsilon machine" in Comp Mech, and it is unique. It is one presentation of the data generating process, but in general there are an infinite number of HMM presntations that would generate the same data. The epsilon machine is a particular type of HMM presentation - it is the smallest one where the hidden states are the minimal sufficient statistics for predicting the future based on the past. The epsilon machine is one of the most fundamental things in Comp Mech but I didn't talk about it in this post. In the future we plan to make a more generic Comp Mech primer that will go through these and other concepts.
- The interpretability of these simplexes is an issue that's in my mind a lot these days. The short answer is I'm still wrestling with it. We have a rough experimental plan to go about studying this issue but for now, here are some related questions I have in my mind:
- What is the relationship between the belief states in the simplex and what mech interp people call "features"?
- What are the information theoretic aspects of natural language (or coding databases or some other interesting training data) that we can instantiate in toy models and then use our understanding of these toy systems to test if similar findings apply to real systems.
For something like situational awareness, I have the beginnings of a story in my head but it's too handwavy to share right now. For something slightly more mundane like out-of-distribution generaliztion or transfer learning or abstraction, the idea would be to use our ability to formalize data-generating structure as HMMs, and then do theory and experiments on what it would mean for a transformer to understand that e.g. two HMMs have similar hidden/abstract structure but different vocabs.
Hopefully we'll have a lot more to say about this kind of thing soon!
Oh wait one thing that looks not quite right is the initial distribution. Instead of starting randomly we begin with the optimal initial distribution, which is the steady-state distribution. Can be computed by finding the eigenvector of the transition matrix that has an eigenvalue of 1. Maybe in practice that doesn't matter that much for mess3, but in general it could.
I should have explained this better in my post.
For every input into the transformer (of every length up to the context window length), we know the ground truth belief state that comp mech says an observer should have over the HMM states. In this case, this is 3 numbers. So for each input we have a 3d ground truth vector. Also, for each input we have the residual stream activation (in this case a 64D vector). To find the projection we just use standard Linear Regression (as implemented in sklearn) between the 64D residual stream vectors and the 3D (really 2D) ground truth vectors. Does that make sense?
Everything looks right to me! This is the annoying problem that people forget to write the actual parameters they used in their work (sorry).
Try x=0.05, alpha=0.85. I've edited the footnote with this info as well.
That sounds interesting. Do you have a link to the apperception paper?
That's an interesting framing. From my perspective that is still just local next-token accuracy (cross-entropy more precisely), but averaged over all subsets of the data up to the context length. That is distinct from e.g. an objective function that explicitly mentioned not just next-token prediction, but multiple future tokens in what was needed to minimize loss. Does that distinction make sense?
One conceptual point I'd like to get across is that even though the equation for the predictive cross-entropy loss only has the next token at a given context window position in it, the states internal to the transformer have the information for predictions into the infinite future.
This is a slightly different issue than how one averages over training data, I think.
Thanks! I appreciate the critique. From this comment and from John's it seems correct and I'll keep it in mind for the future.
On the question, by optimize the representation do you mean causally intervene on the residual stream during inference (e.g. a patching experiment)? Or do you mean something else that involves backprop? If the first, then we haven't tried, but definitely want to! It could be something someone does at the Hackathon, if interested ;)
Cool question. This is one of the things we'd like to explore more going forward. We are pretty sure this is pretty nuanced and has to do with the relationship between the (minimal) state of the generative model, the token vocab size, and the residual stream dimensionality.
One your last question, I believe so but one would have to do the experiment! It totally should be done. check out the Hackathon if you are interested ;)
this looks highly relevant! thanks!
Good catch! That should be eta_00, thanks! I'll change it tomorrow.
Cool idea! I don't know enough about GANs and their loss so I don't have a prediction to report right now. If it is the case that GAN loss should really give generative and not predictive structure, this would be a super cool experiment.
The structure of generation for this particular process has just 3 points equidistant from eachother, no fractal. But in general the shape of generation is a pretty nuanced issue because it's nontrivial to know for sure that you have the minimal structure of generation. There's a lot more to say about this but @Paul Riechers knows these nuances more than I do so I will leave it to him!
Responding in reverse order:
If there's literally a linear projection of the residual stream into two dimensions which directly produces that fractal, with no further processing/transformation in between "linear projection" and "fractal", then I would change my mind about the fractal structure being mostly an artifact of the visualization method.
There is literally a linear projection (well, we allow a constant offset actually, so affine) of the residual stream into two dimensions which directly produces that fractal. There's no distributions in the middle or anything. I suspect the offset is not necessary but I haven't checked ::adding to to-do list::
edit: the offset isn't necessary. There is literally a linear projection of the residual stream into 2D which directly produces the fractal.
But the "fractal-ness" is mostly an artifact of the MSP as a representation-method IIUC; the stochastic process itself is not especially "naturally fractal".
(As I said I don't know the details of the MSP very well; my intuition here is instead coming from some background knowledge of where fractals which look like those often come from, specifically chaos games.)
I'm not sure I'm following, but the MSP is naturally fractal (in this case), at least in my mind. The MSP is a stochastic process, but it's a very particular one - it's the stochastic process of how an optimal observer's beliefs (about which state an HMM is in) change upon seeing emissions from that HMM. The set of optimal beliefs themselves are fractal in nature (for this particular case).
Chaos games look very cool, thanks for that pointer!
Can you elaborate on how the fractal is an artifact of how the data is visualized?
From my perspective, the fractal is there because we chose this data generating structure precisely because it has this fractal pattern as it's Mixed State Presentation (ie. we chose it because then the ground truth would be a fractal, which felt like highly nontrivial structure to us, and thus a good falsifiable test that this framework is at all relevant for transformers. Also, yes, it is pretty :) ). The fractal is a natural consequence of that choice of data generating structure - it is what Computational Mechanics says is the geometric structure of synchronization for the HMM. That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.
Though it should be said that an HMM with a fractal MSP is a quite generic choice. It's remarkably easy to get such fractal structures. If you randomly chose an HMM from the space of HMMs for a given number of states and vocab size, you will often get synchronizations structures with infinite transient states and fractals.
This isn't a proof of that previous claim, but here are some examples of fractal MSPs from https://arxiv.org/abs/2102.10487:
I find this focus on task structure and task decomposition to be incredibly important when thinking about what neural networks are doing, what they could be doing in the future, and how they are doing it. The manner in which a system understands/represents/instantiates task structures and puts them in relation to one another is, as far as I can tell, just a more concrete way of asking "what is it that this neural network knows? what cognitive abilities does it have? what abstractions is it making? under what out of distribution inputs will it succeed/fail, etc."
This comment isn't saying anything that wasn't in the post, just wanted to express happiness and solidarity with this framing!
I do wonder if the tree-structure of which-task and then task algorithm is what we should expect, in general. I have nothing super concrete to say here, my feeling is just that the manners in which a neural network can represent structures and put them in relation to eachother may be instantiated differently than a tree (with that specific ordering). The onus is probably on me here though - I should come up with a set of tasks in certain relations that aren't most naturally described with tree structures.
Another question that comes to mind is, is there a hard distinction between categorizing which sub-task one is in and the algorithm which carries out the computation for a specific subtask. Is it all just tasks all the way down?
I think you might need to change permissions on your github repository?
The blog post linked says it's from August. Is there something new I'm missing?
This is so cool! Thanks so much, I plan to go through it in full when I have some time. For now, I was wondering if the red circled matrix multiplication should actually be reversed, and the vector should be column (ie. matrix*column, instead of row*matrix). I know the end result is equivalent but it seems in order to be consistent it should be switched, ie in every other example of a vector with leg sticking out leftward its a column vector? maybe this really doesnt matter since I can just turn the page upside down and then b would be on the left with a leg sticking out to the right..., but the fact that A dot b = b.T dot A is itself an interesting fact.
Just to add to Carl Feynman's response, which I thought was good.
Part of the reason these systems are inefficient is because it requires you to (effectively) run gradient descent even at inference, even after training is over. Or you can run the RNN, which is mathematically equivalent but again you can see where the inefficiency comes in: the value at time t=3 is a function of the value at time t=2, which is a function of t=1 and so on, so in order to get the converged value of the activations you have to, in a for loop, compute each timestep one by one.
This is in contrast to a feedforward network like a (normal) convnet or transformer, which can run extremely quickly and in parallel on gpu.
Thanks!
I think your thinking makes sense, and, if for instance on every timestep you presented a different images in a stereotypically defined sequence, or with a certain correlation structure, you would indeed get information about those correlations in the weights. However, this model was designed to be used in the restricted to settings where you show a single still image for many timesteps until convergence. In that setting, weights give you image features for static images (in a heirarchical manner), and priors for low level features will feed back from activations in higher level areas.
There are extensions to this model that deal with video, where there are explicit spatiotemporal expectations built into the network. you can see one of those networks in this paper: https://arxiv.org/abs/2112.10048
But I've never implemented such a network myself.
First, brains (and biological systems more generally) have many constraints that artificial networks do not. Brains exist in the context of a physically instantiated body, with heavy energy constraints. Further, they exist in specific niches, with particular evolutionary histories, which has enormous effects on structure and function.
Second, biological brains have different types of intelligence from AI systems, at least currently. A bird is able to land fluidly on a thin branch in windy conditions, while gpt4 can help you code. In general, the intelligences that one thinks of in the context of AGI do not totally overlap with the varied, often physical and metabolic, intelligences of biology.
All that being said, who knows what future AI systems will look like
Thanks so much for this comment (and sorry for taking ~1 year to respond!!). I really liked everything you said.
For 1 and 2, I agree with everything and don't have anything to add.
3. I agree that there is something about the input/output mapping that is meaningful but it is not everything. Having a full theory for exactly the difference, and what the distinctions between what structure counts as interesting internal computation (not a great descriptor of what I mean but can't think of anything better right now) vs input output computation would be great.
4. I also think a great goal would be in generalizing and formalizing what an "observer" of a computation is. I have a few ideas but they are pretty half-baked right now.
5. That is an interesting point. I think it's fair. I do want to be careful to make sure that any "disagreements" are substantial and not just semantic squabling here. I like your distinction between representation work and computational work. The idea of using vs. performing a computation is also interesting. At the end of the day I am always left craving some formalism where you could really see the nature of these distinctions.
6. Sounds like a good idea!
7. Agreed on all counts.
8. I was trying to ask the question if there is anything that tells us that the output node is semantically meaningful without reference to e.g. the input images of cats, or even knowledge of the input data distribution. Interpretability work, both in artificial neural networks and more traditionally in neuroscience, always use knowledge of input distributions or even input identity to correlate activity of neurons to the input, and in that way assign semantics to neural activity (e.g. recently, othello board states, or in neuroscience jennifer aniston neurons or orientation tuned neurons) . But when I'm sitting down with my eyes closed and just thinking, there's no homonculus there that has access to input distributions on my retina that can correlate some activity pattern to "cat." So how can the neural states in my brain "represent" or embody or whatever word you want to use, the semantic information of cat, without this process of correlating to some ground truth data. WHere does "cat" come from when theres no cat there in the activity?!
9. SO WILD
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?