adam-shai

Some reasons that come to mind very quickly:

- Patch clamp experiments usually take place in slices with artificial cerebrospinal fluid (ACSF). The ephys properties can vary widely based on the experimental prep (angle that slice was taken, the temperature, the specific recipe used for the ACSF, the quality of the patcher, etc. etc.

even if patching worked really well and was robust and reliable, the ephys properties at the soma (where vast majority of patching willt ake place) hardly describe the ephys of the entire dendritic tree, which is very complicated and space dependent, and incredibly nonlinear and variable.

Comment by Adam Shai (adam-shai) on Straightforward Steps to Marginally Improve Odds of Whole Brain Emulation · 2025-03-26T03:55:42.541Z · LW · GW

Under the assumption that capturing the ephys properties of single neurons is important for WBE, it still seems unlikely to me that scaling up patch clamping is a viable path to that. More likely to work would be trying to scale up voltage imaging.

(for the record I don't personally agree with that assumption, for overlapping reasons with what Steven Byrnes thinks).

Comment by Adam Shai (adam-shai) on 8 examples informing my pessimism on uploading without reverse engineering · 2025-03-26T01:30:58.006Z · LW · GW

Comment by Adam Shai (adam-shai) on Shortform · 2025-02-03T02:34:33.241Z · LW · GW

I think this really depends on what "good" means exactly. For instance, if humans think it's good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our "good" mech interp to scheme more deceptively.

I'm guessing your notion of good must explicitly mean that this scenario isn't possible. But this really begs the question - how could we know if our mech interp has reached that level of goodness?

Comment by Adam Shai (adam-shai) on Adam Shai's Shortform · 2025-01-07T19:35:29.088Z · LW · GW

Thanks, this is helpful. I'm still a bit unclear about how to use the word/concept "amortized inference" correctly. Is the first example you gave, of training an AI model on (query, well-thought guess), an example of amortized inference, relative to training on (query, a bunch of reasoning + well-thought out guess)?

Comment by Adam Shai (adam-shai) on Alexander Gietelink Oldenziel's Shortform · 2025-01-07T19:20:02.596Z · LW · GW

This sounds right to me, but importantly it also matters what you are trying to understand (and thus compress). For AI safety, the thing we should be interested in is not the weights directly, but the behavior of the neural network. The behavior (the input-output mapping) is realized through a series of activations. Activations are realized through applying weights to inputs in particular ways. Weights are realized by setting up an optimization problem with a network architecture and training data. One could try compressing at any one of those levels, and of course they are all related, and in some sense if you know the earlier layer of abstraction you know the later one. But in another sense, they are fundamentally different, in exactly how quickly you can retrieve the specific piece of information, in this case the one we are interested in - which is the behavior. If I give you the training data, the network architecture, and the optimization algorithm, it still takes a lot of work to retrieve the behavior.

Thus, the story you gave about how accessibility matters also explains layers of abstraction, and how they relate to understanding.

Another example of this is a dynamical system. The differential equation governing it is quite compact: $\dot{x}=f(x)$. But the set of possible trajectories can be quite complicated to describe, and to get them one has to essentially do all the annoying work of integrating the equation! Note that this has implications for compositionality of the systems: While one can compose two differential equations by e.g. adding in some cross term, the behaviors (read: trajectores) of the composite system do not compose! and so one is forced to integrate a new system from scratch!

Now, if we want to understand the behavior of the dynamical system, what should we be trying to compress? How would our understanding look different if we compress the governing equations vs. the trajectories?

Comment by Adam Shai (adam-shai) on Dmitry Vaintrob's Shortform · 2025-01-06T00:25:32.120Z · LW · GW

Ari's work is on Arxiv here

Comment by Adam Shai (adam-shai) on Adam Shai's Shortform · 2025-01-05T20:44:06.763Z · LW · GW

Yes, I'm thinking of that line of work. I actually think the first few paragraphs of this paper does a better job of getting the vibes I want (and I should emphasize these are vibes that I have, not any kind of formal understanding). So here's my try at a cached explanation of the concept of amortized inference I'm trying to evoke:

A lot of problems are really hard, and the algorithmic/reasoning path from the question to the answer are many steps. But it seems that in some cases humans are much faster than that (perhaps by admitting some error, but even so, they are both fast and quite good at the task). The idea is that in these settings a human brain is performing amortized inference - because they've seen similar examples of the input/output relation of the task before, they can use that direct mapping as a kind of bootstrap for the new task at hand, saving a lot of inference time.

Now that i've typed that out it feels maybe similar to your stuff about heuristics?

Big caveat here: it's quite possible I'm misunderstanding amortized inference (maybe @jessicata can help here?), as well as reaching with the connection to your work.

Comment by Adam Shai (adam-shai) on Adam Shai's Shortform · 2025-01-05T16:53:17.362Z · LW · GW

I've been trying to get my head around how to theoretically think about scaling test time compute, CoT, reasoning, etc. One frame that keeps on popping into my head is that these methods are a type of un-amortization.

In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a "shortcut" between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.

In the current techniques for using test time compute we do the opposite - we pay costs during inference in order to explicitly capture the path between question and answer.

Uncertainties and things I would like to see:

I'm far from an expert in amortization and don't know if this is a reasonable use of the concept
Can we use this framing to make a toy model of using test time compute? I'd really like for the theoretically minded style of interp I do to keep up with current techniques.
If we had a toy model I could see getting theoertical clarity on the following:
- What's the relation between explicit reasoning vs. internal reasoning
- What does it mean to have CoT be "faithful" to the internals
- What features and geometric structures underlie reasoning
- Why is explicit reasoning such a strong mechanism for out of distribution generalization?

Comment by Adam Shai (adam-shai) on My January alignment theory Nanowrimo · 2025-01-02T03:04:33.239Z · LW · GW

Excited to read what you share!

Comment by Adam Shai (adam-shai) on Adam Shai's Shortform · 2025-01-01T05:09:47.845Z · LW · GW

Some personal reflections on the last year, and some thoughts for next:

1 year ago I quit my career as an academic experimental neuroscientist and began doing AI technical safety research full time. This was emotionally difficult! For more than a decade I had been committed to becoming a neuroscience professor, and had spent a lot of my 20s and 30s pursuing that end. So the move, which had its natural uncertainties (can I succeed in a totally different field? will I be able to support my family financially?) was made more difficult by an ingrained identity as a neuroscientist. In retrospect I wish I had made the move earlier (as Neel Nanda had suggested to me!), oh well, live and learn!
I was extremely lucky to have the support of PIBBSS as I transitioned (special thanks to Nora and Lucas). The main work that came out of my time there is a dream come true. I had read about computational mechanics ~1 decade ago after reading a Melanie Mitchell book, and had tried multiple times to apply it to neuroscience data. I completely failed each time, but would come back to it every now and then. Meeting Paul Riechers was game changing - both his deep knowledge and, even more importantly, his supportive and positive attitude have been a true blessing.
I also want to mention Alexander Oldenziel, who has been and continues to be supportive, and is an inspirational model of tenaciousness and agency. He was the first person in the AI safety community who heard me rant about comp mech, and who suggested that comp mech might be able to do some work there.
Paul and I started Simplex this year! It kind of feels like starting an academic lab, except not in academia, and with someone else. Definitely an exciting journey! One thing that feels different than I imagine staying in academia would feel is the sense of purpose - I really do believe our point of view and work will be important for AI safety.
Speaking just for myself, I underestimated how difficult it would be to raise money, and how much time it would take. Getting better at this skill is going to be a focus of the next year.
I watched my daughter grow from 1 to 2 years old. Everything about this fact is hard to put into words. I won't try.
While people have definitely shown in interest in our initial work at Simplex, I think for the most part people are unaware of the larger picture of how we think about comp mech and its relation to AI safety. This is mostly because we really haven't spoken about it in public very much! That will change in the coming year. Comp mech is much deeper and broader than the belief state geometry story presented.
For the most part though, we've chosen to take a show rather than tell approach. We want the quality of our work to be very high, we want to overdeliver. If someone doesn't understand our point of view we would rather show them its utility by example rather than by argument or philosophy. I'm happy with that, though it has probably meant a slower public facing start. We have a lot more public facing things in store for 2025.
I can't seem to update my beliefs appropriately when new AI capabilities come out. I am shocked. Every. Single. Time. This still feels like magic to me. Scary magic. Beautiful magic Weird magic. Where are we going?

Happy New Year everyone!

Comment by Adam Shai (adam-shai) on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T21:57:53.155Z · LW · GW

I suppose it depends on what one wants to do with their "understanding" of the system? Here's one AI safety case I worry about: if we (humans) don’t understand the lower-level ontology that gives rise to the phenomenon that we are more directly interested in (in this case I think thats something like an AI systems behavior/internal “mental” states - your "structurally what", if I'm understanding correctly, which to be honest I'm not very confident I am), then a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.

I very much agree that structurally what matters a lot, but that seems like half the battle to me.

Comment by Adam Shai (adam-shai) on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T19:36:52.192Z · LW · GW

I think I disagree, or need some clarification. As an example, the phenomenon in question is that the physical features of children look more or less like combinations of the parents features. Is the right kind of abstraction a taxonomy and theory of physical features at the level of nose-shapes and eyebrow thickness? Or is it at the low-level ontology of molecules and genes, or is it in the understanding of how those levels relate to eachother?

Or is that not a good analogy?

Comment by Adam Shai (adam-shai) on Testing which LLM architectures can do hidden serial reasoning · 2024-12-16T22:08:07.318Z · LW · GW

Thanks. I really like this task!

It's hard for me to interpret these results without some indication of how good these networks actually are at the task though. E.g. it is possible that even though a network could solve a length=N task once out of however many attempts you made, that it just got lucky, or is running some other heuristic that just happens to work for that one time. I understand why you were interested in how things scale with length of problem given your interest in recurrence and processing depth. But would it be hard to make a plot where x axis is length of problem, and y axis is accuracy or loss?

Comment by Adam Shai (adam-shai) on Testing which LLM architectures can do hidden serial reasoning · 2024-12-16T17:16:24.179Z · LW · GW

What is the y-axis in your plots? Where would 100% accuracy be?

Comment by Adam Shai (adam-shai) on o1: A Technical Primer · 2024-12-09T22:14:10.409Z · LW · GW

Thanks for writing this! It's not easy to keep up with progress, and posts like this make it easier.

Comment by Adam Shai (adam-shai) on An Introduction to AI Sandbagging · 2024-10-30T21:08:15.698Z · LW · GW

One thing I am confused about: especially in cases of developer sandbagging, my intuition is that the mechanisms underlying the underperformance could be very similar to cases of "accidental" sandbagging (ie not sandbagging according to your def). More operationally, your example 1 and example 4 might have the same underlying issue from the perspective of the model itself, and if we want to find technical solutions to those particular examples they might look the same. If that's the case then it's not obvious to me that the "strategic" condition is a useful place to "cut nature at its joints."

Or to say it a different way, what operationally defines the difference between example 1 and 4 is that in ex.1 there is fine-tuning on a different dataset, and on ex.4 the extra dataset is part of the pre-training dataset. The model itself doesn't see the intent of the developer directly, so as far as technical solutions that only depend on the model itself, it's not obvious that the intent of the developer matters.

A developer could intentionally inject noisy and error-prone data into training, but the model would treat that equivalently to the case of where it was in the dataset by mistake.

Comment by Adam Shai (adam-shai) on The Geometry of Feelings and Nonsense in Large Language Models · 2024-09-28T19:44:33.298Z · LW · GW

Did the original paper do any shuffle controls? Given your results I suspect such controls would have failed. For some reason this is not standard practice in AI research, despite it being extremely standard in other disciplines.

Comment by Adam Shai (adam-shai) on Open Thread Summer 2024 · 2024-09-09T18:57:49.053Z · LW · GW

https://pypi.org/project/fancy-einsum/ there's also this.

Comment by Adam Shai (adam-shai) on Extended Interview with Zhukeepa on Religion · 2024-08-23T05:07:30.747Z · LW · GW

Thanks this was clarifying. I am wondering if you agree with the following (focusing on the predictive processing parts since that's my background):

There are important insights and claims from religious sources that seem to capture psychological and social truths that aren't yet fully captured by science. At least some of these phenomenon might be formalizable via a better understanding of how the brain and the mind work, and to that end predictive processing (and other theories of that sort) could be useful to explain the phenomenon in question.

You spoke of wanting formalization but I wonder if the main thing is really the creation of a science, though of course math is a very useful tool to do science with and to create a more complete understanding. At the end of the day we want our formalizations to comport to reality - whatever aspects of reality we are interested in understanding.

Comment by Adam Shai (adam-shai) on Extended Interview with Zhukeepa on Religion · 2024-08-20T04:37:28.727Z · LW · GW

which is being able to ground the apparently contradictory metaphysical claims across religions into a single mathematical framework.

Is there a minimal operationalized version of this? Something that is the smallest formal or empirical result one could have that would count to you as small progress towards this goal?

Comment by Adam Shai (adam-shai) on Dalcy's Shortform · 2024-07-30T20:06:42.008Z · LW · GW

Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there's a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn't fit in a concrete way right now, maybe there's room to extend/modify things to combine things in a fruitful way? Any thoughts?

Comment by Adam Shai (adam-shai) on Lucius Bushnaq's Shortform · 2024-07-06T17:15:14.267Z · LW · GW

I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I'd be after in terms of what I can build is a talking machine of a certain toyish flavor - a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn't know until our understanding is more complete.

Comment by Adam Shai (adam-shai) on SAE feature geometry is outside the superposition hypothesis · 2024-06-25T13:20:19.041Z · LW · GW

I'm wondering if you have any other pointers to lessong/methods you think are valuable from neuroscience?

Comment by Adam Shai (adam-shai) on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-17T23:52:04.930Z · LW · GW

This makes a lot of sense to me, and makes me want to figure out exactly how to operationalize and rigorously quantify depth of search in LLMs! Quick thought is that it should have something to do with the spectrum of the transition matrix associated with the mixed state presentation (MSP) of the data generating process, as in Transformers Represent Belief State Geometry in their Residual Stream . The MSP describes synchronization to the hidden states of the data generating process, and that feels like a search process that has max-depth of the Markov order of the data generating process.

I really like the idea that memorization and this more lofty type of search are on a spectrum, and that placement on this spectrum has implications for capabilities like generalization. If we can figure out how to understand these things a more formally/rigorously that would be great!

Comment by Adam Shai (adam-shai) on Alexander Gietelink Oldenziel's Shortform · 2024-06-13T20:01:35.535Z · LW · GW

I can report my own feelings with regards to this. I find cities (at least the American cities I have experience with) to be spiritually fatiguing. The constant sounds, the lack of anything natural, the smells - they all contribute to a lack of mental openness and quiet inside of myself.

The older I get the more I feel this.

Jefferson had a quote that might be related, though to be honest I'm not exactly sure what he was getting at:

I think our governments will remain virtuous for many centuries; as long as they are chiefly agricultural; and this will be as long as there shall be vacant lands in any part of America. When they get piled upon one another in large cities, as in Europe, they will become corrupt as in Europe. Above all things I hope the education of the common people will be attended to; convinced that on their good sense we may rely with the most security for the preservation of a due degree of liberty.

One interpretation of this is that Jefferson thought there was something spiritually corrupting of cities. This supported by another quote:

I view great cities as pestilential to the morals, the health and the liberties of man. true, they nourish some of the elegant arts; but the useful ones can thrive elsewhere, and less perfection in the others with more health virtue & freedom would be my choice.

although like you mention, there does seem to be some plausible connection to disease.

Comment by Adam Shai (adam-shai) on Demystifying "Alignment" through a Comic · 2024-06-09T18:53:30.412Z · LW · GW

I've also noticed this phenomenon. I wonder if a solution would be to have an initial period where votes are considered more democratically, and then after that period the influence of high-karma users are applied (including back applying the influence of votes that occured during the intial period). I can also imagine downsides to this.

Comment by Adam Shai (adam-shai) on Computational Mechanics Hackathon (June 1 & 2) · 2024-05-29T15:55:24.740Z · LW · GW

We've decided to keep the hackathon as scheduled. Hopefully there will be other opportunities in the future for those that can't make it this time!

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-26T03:39:39.422Z · LW · GW

Thanks! In my experience Computational Mechanics has many of those types of technical insights. My background is in neuroscience and in that context it really helped me think about computation in brains, and design experiments. Now I'm excited to use Comp Mech in a more concrete and deeper way to understand how artificial neural network internal structures relate to their behavior. Hopefully this is just the start!

Comment by Adam Shai (adam-shai) on Computational Mechanics Hackathon (June 1 & 2) · 2024-05-25T14:21:25.487Z · LW · GW

Also a good point. Thanks

Comment by Adam Shai (adam-shai) on Computational Mechanics Hackathon (June 1 & 2) · 2024-05-25T14:17:33.318Z · LW · GW

No, thanks for pointing this out

Comment by Adam Shai (adam-shai) on Alexander Gietelink Oldenziel's Shortform · 2024-05-14T01:27:44.636Z · LW · GW

Lengthening from what to what?

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-03T01:31:50.937Z · LW · GW

This is a great question, and one of the things I'm most excited about using this framework to study in the future! I have a few ideas but nothing to report yet.

But I will say that I think we should be able to formalize exactly what it would mean for a transformer to create/discover new knowledge, and also to apply the structure from one dataset and apply it to another, or to mix two abstract structures together, etc. I want to have an entire theory of cognitive abilities and the geometric internal structures that support them.

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-03T01:28:23.738Z · LW · GW

If I'm understanding your question correctly, then the answer is yes, though in practice it might be difficult (I'm actually unsure how computationally intensive it would be, haven't tried anything along these lines yet). This is definitely something to look into in the future!

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-03T01:25:45.484Z · LW · GW

It's surprising for a few reasons:

The structure of the points in the simplex is NOT
- The next token prediction probabilities (ie. the thing we explicitly train the transformer to do)
- The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not)

The first would be not surprising because it's literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say "model of the world." But the MSP structure is neither of those things. It's the structure of inference over the model of the world, which is quite a different beast than the model of the world.

Others might not find it as surprising as I did - everyone is working off their own intuitions.

edit: also I agree with what Kave said about the linear representation.

Comment by Adam Shai (adam-shai) on Adam Shai's Shortform · 2024-04-23T03:34:02.921Z · LW · GW

A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don't know how much of the phenomenon depends on the particular semantics of terms like "goal" and "deception" and "lie" (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I'm thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-20T01:47:05.424Z · LW · GW

This all looks correct to me! Thanks for this.

Comment by Adam Shai (adam-shai) on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-04-19T23:42:44.544Z · LW · GW

Thanks John and David for this post! This post has really helped people to understand the full story. I'm especially interested in thinking more about plans for how this type of work can be helpful for AI safety. I do think the one you presented here is a great one, but I hope there are other potential pathways. I have some ideas, which I'll present in a post soon, but my views on this are still evolving.

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-19T23:39:02.362Z · LW · GW

Thanks! I'll have more thorough results to share about layer-wise reprsentations of the MSP soon. I've already run some of the analysis concatenating over all layers residual streams with RRXOR process and it is quite interesting. It seems there's a lot more to explore with the relationship between number of states in the generative model, number of layers in the transformer, residual stream dimension, and token vocab size. All of these (I think) play some role in how the MSP is represented in the transformer. For RRXOR it is the case that things look crisper when concatenating.

Even for cases where redundant info is discarded, we should be able to see the distinctions somewhere in the transformer. One thing I'm keen on really exploring is such a case, where we can very concretely follow the path/circuit through which redundant info is first distinguished and then is collapsed.

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-19T17:08:13.076Z · LW · GW

That is a fair summary.

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-18T21:55:59.382Z · LW · GW

Thanks!

one way to construct an HMM is by finding all past histories of tokens that condition the future tokens with the same probablity distribution, and make that equivalence class a hidden state in your HMM. Then the conditional distributions determine the arrows coming out of your state and which state you go to next. This is called the "epsilon machine" in Comp Mech, and it is unique. It is one presentation of the data generating process, but in general there are an infinite number of HMM presntations that would generate the same data. The epsilon machine is a particular type of HMM presentation - it is the smallest one where the hidden states are the minimal sufficient statistics for predicting the future based on the past. The epsilon machine is one of the most fundamental things in Comp Mech but I didn't talk about it in this post. In the future we plan to make a more generic Comp Mech primer that will go through these and other concepts.
The interpretability of these simplexes is an issue that's in my mind a lot these days. The short answer is I'm still wrestling with it. We have a rough experimental plan to go about studying this issue but for now, here are some related questions I have in my mind:
- What is the relationship between the belief states in the simplex and what mech interp people call "features"?
- What are the information theoretic aspects of natural language (or coding databases or some other interesting training data) that we can instantiate in toy models and then use our understanding of these toy systems to test if similar findings apply to real systems.

For something like situational awareness, I have the beginnings of a story in my head but it's too handwavy to share right now. For something slightly more mundane like out-of-distribution generaliztion or transfer learning or abstraction, the idea would be to use our ability to formalize data-generating structure as HMMs, and then do theory and experiments on what it would mean for a transformer to understand that e.g. two HMMs have similar hidden/abstract structure but different vocabs.

Hopefully we'll have a lot more to say about this kind of thing soon!

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T22:13:09.078Z · LW · GW

Oh wait one thing that looks not quite right is the initial distribution. Instead of starting randomly we begin with the optimal initial distribution, which is the steady-state distribution. Can be computed by finding the eigenvector of the transition matrix that has an eigenvalue of 1. Maybe in practice that doesn't matter that much for mess3, but in general it could.

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T22:10:34.911Z · LW · GW

I should have explained this better in my post.

For every input into the transformer (of every length up to the context window length), we know the ground truth belief state that comp mech says an observer should have over the HMM states. In this case, this is 3 numbers. So for each input we have a 3d ground truth vector. Also, for each input we have the residual stream activation (in this case a 64D vector). To find the projection we just use standard Linear Regression (as implemented in sklearn) between the 64D residual stream vectors and the 3D (really 2D) ground truth vectors. Does that make sense?

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T22:03:21.123Z · LW · GW

Everything looks right to me! This is the annoying problem that people forget to write the actual parameters they used in their work (sorry).

Try x=0.05, alpha=0.85. I've edited the footnote with this info as well.

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T20:53:40.104Z · LW · GW

That sounds interesting. Do you have a link to the apperception paper?

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T18:42:32.194Z · LW · GW

That's an interesting framing. From my perspective that is still just local next-token accuracy (cross-entropy more precisely), but averaged over all subsets of the data up to the context length. That is distinct from e.g. an objective function that explicitly mentioned not just next-token prediction, but multiple future tokens in what was needed to minimize loss. Does that distinction make sense?

One conceptual point I'd like to get across is that even though the equation for the predictive cross-entropy loss only has the next token at a given context window position in it, the states internal to the transformer have the information for predictions into the infinite future.

This is a slightly different issue than how one averages over training data, I think.

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T18:26:48.089Z · LW · GW

Thanks! I appreciate the critique. From this comment and from John's it seems correct and I'll keep it in mind for the future.

On the question, by optimize the representation do you mean causally intervene on the residual stream during inference (e.g. a patching experiment)? Or do you mean something else that involves backprop? If the first, then we haven't tried, but definitely want to! It could be something someone does at the Hackathon, if interested ;)

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T18:24:27.366Z · LW · GW

Cool question. This is one of the things we'd like to explore more going forward. We are pretty sure this is pretty nuanced and has to do with the relationship between the (minimal) state of the generative model, the token vocab size, and the residual stream dimensionality.

One your last question, I believe so but one would have to do the experiment! It totally should be done. check out the Hackathon if you are interested ;)

Comment by Adam Shai (adam-shai) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T15:39:13.118Z · LW · GW

this looks highly relevant! thanks!

User info

Posts

Comments