Posts
Comments
Hypothesis: Models distinguish evaluation scenarios from their training data based on perplexity. Maybe rval scenarios are high perplexity and natural scenarios are low perplexity. (Skimming the report, it doesn’t seem like tou checked this.)
Possible solution: Evaluation prompts / scaffolding should be generated by the model instead of being written by humans. Ie write things in the model’s “own voice”.
Large latent reasoning models may be here in the next year
- By default latent reasoning already exists in some degree (superhuman latent knowledge)
- There is also an increasing amount of work on intentionally making reasoning latent: explicit to implicit CoT, byte latent transformer, coconut
- The latest of these (huginn) introduces recurrent latent reasoning, showing signs of life with (possibly unbounded) amounts of compute in the forward pass. Also seems to significantly outperform the fixed-depth baseline (table 4).
Imagine a language model that can do a possibly unbounded amount of internal computation in order to compute its answer. Seems like interpretability will be very difficult. This is worrying because externalised reasoning seems upstream of many other agendas
How can we study these models?
- A good proxy right now may be language models provided with hidden scratchpads.
- Other kinds of model organism seem really important
- If black box techniques don't work well we might need to hail mary on mech interp.
Hmm I don't think there are people I can single out from my following list that have high individual impact. IMO it's more that the algorithm has picked up on the my trend of engagement and now gives me great discovery.
For someone else to bootstrap this process and give maximum signal to the algorithm, the best thing to do might just be to follow a bunch of AI safety people who:
- post frequently
- post primarily about AI safety
- have reasonably good takes
Some specific people that might be useful:
- Neel Nanda (posts about way more than mech interp)
- Dylan Hadfield-Menell
- David Duvenaud
- Stephen Casper
- Harlan Stewart (nontechnical)
- Rocket Drew (nontechnical)
I also follow several people who signal-boost general AI stuff.
- Scaling lab leaders (Jan Leike, Sam A, dario)
- Scaling lab engineers (roon, Aidan McLaughlin, Jason Wei)
- Huggingface team leads (Philip Schmidt, Sebastian Raschka)
- Twitter influencers (Teortaxes, janus, near)
IMO the best part of breadth is having an interesting question to ask. LLMs can mostly do the rest
What is prosaic interpretability? I've previously alluded to this but not given a formal definition. In this note I'll lay out some quick thoughts.
Prosaic Interpretability is empirical science
The broadest possible definition of "prosaic" interpretability is simply 'discovering true things about language models, using experimental techniques'.
A pretty good way to do this is to loop the following actions.
- Choose some behaviour of interest.
- Propose a hypothesis about how some factor affects it.
- Try to test it as directly as possible.
- Try to test it in as many ways as possible.
- Update your hypothesis and repeat.
Hypothesis generation is about connecting the dots.
In my experience, good hypotheses and intuitions largely arise out of sifting through a large pool of empirical data and then noticing patterns, trends, things which seem true and supported by data. Like drawing constellations between the stars.
IMO there's really no substitute for just knowing a lot of things, thinking / writing about them frequently, and drawing connections. But going over a large pool is a lot of work. It's important to be smart about this.
Be picky. Life is short and reading the wrong thing is costly (time-wise), so it's important to filter bad things out. I used to trawl Arxiv for daily updates. I've stopped doing this, since >90% of things are ~useless. Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Distill. I think >90% of empirical work can be summarised down to a "key idea". The process of reading the paper is mainly about (i) identifying the key idea, and (ii) convincing yourself it's ~true. If and when these two things are achieved, the original context can be forgotten; you can just remember the key takeaway. Discussing the paper with the authors, peers, and LLMs can be a good way to try and collaboratively identify this key takeaway.
Hypothesis testing is about causal interventions.
In order to test hypotheses, it's important to do a causal interventions and study the resulting changes. Some examples are:
- Change the training dataset / objective (model organisms)
- Change the test prompt used (jailbreaking)
- Change the model's forward pass (pruning, steering, activation patching)
- Change the training compute (longitudinal study)
In all cases you usually want to have sample size > 1. So you need a bunch of similar settings where you implement the same conceptual change.
- Model organisms: Many semantically similar training examples, alter all of them in the same way (e.g. adding a backdoor)
- Jailbreaking: Many semantically similar prompts, alter all of them in the same way (e.g. by adding an adversarial suffix)
- etc.
Acausal analyses. It's also possible to do other things, e.g. non-causal analyse. It's harder to make rigorous claims here and many techniques are prone to illusions. Nonetheless these can be useful for building intuition
- Attribute behaviour to weights, activations (circuit analysis, SAE decomposition)
- Attribute behaviour to training data (influence functions)
Conclusion
You may have noticed that prosaic interpretability, as defined here, is very broad. I think this sort of breadth is necessary for having many reference points by which to evaluate new ideas or interpret new findings, c.f. developing better research taste.
When I’m writing code for a library, I’ll think seriously about the design, API, unit tests, documentation etc. AI helps me implement those.
When I’m writing code for an experiment I let AI take the wheel. Explain the idea, tell it rough vibes of what I want and let it do whatever. Dump stack traces and error logs in and let it fix. Say “make it better”. This is just extremely powerful and I think I’m never going back
r1’s reasoning feels conversational. Messy, high error rate, often needs to backtrack. Stream of thought consciousness rambling.
Other models’ reasoning feels like writing. Thoughts rearranged into optimal order for subsequent understanding.
In some sense you expect that doing SFT or RLHF with a bunch of high quality writing makes models do the latter and not the former.
Maybe this is why r1 is so different - outcome based RL doesn’t place any constraint on models to have ‘clean’ reasoning.
I'm imagining it's something encoded in M1's weights. But as a cheap test you could add in latent knowledge via the system prompt and then see if finetuning M2 on M1's generations results in M2 having the latent knowledge
Finetuning could be an avenue for transmitting latent knowledge between models.
As AI-generated text increasingly makes its way onto the Internet, it seems likely that we'll finetune AI on text generated by other AI. If this text contains opaque meaning - e.g. due to steganography or latent knowledge - then finetuning could be a way in which latent knowledge propagates between different models.
How I currently use different AI
- Claude 3.5 sonnet: Default workhorse, thinking assistant, Cursor assistant, therapy
- Deep Research: Doing comprehensive lit reviews
- Otter.ai: Transcribing calls / chats
Stuff I've considered using but haven't, possibly due to lack of imagination:
- Operator - uncertain, does this actually save time on anything?
- Notion AI search - seems useful for aggregating context
- Perplexity - is this better than Deep Research for lit reviews?
- Grok - what do people use this for?
Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences
The hardness of cross-prediction may yield cryptographically secure communication.
An interesting result from Owain's introspection paper is that language models can predict their own outputs better than other language models can. [see figure]
Claim: This yields an encryption scheme for secret communication.
The encryption scheme
Suppose Alice and Bob want to secretly communicate over a public channel, while Eve wants to eavesdrop on the conversation.
- Assume Alice and Bob both have access to a specific model checkpoint , which has been fine-tuned for introspection.
- Eve has access to a different model checkpoint (which may be more powerful) but not the same one as Alice and Bob. This is their private key.
- Suppose Alice wants to send a binary message .
- Alice searches for a list of A/B prompts such that .
- Each prompt contains some preference question, e.g. "Should we charge users for software updates".
- Alice and Bob can agree on the same decoding scheme, e.g. greedy sampling with temperature = 0
- Alice sends to Bob over the public channel.
- Bob then runs model on to produce .
Cryptanalysis
Conjecture: Eve cannot recover from with high fidelity, without also knowing
The result from Owain's paper is that usually cannot predict at a rate substantially above random guessing. This result holds even if has been finetuned explicitly to predict .
Why is this interesting?
The more general point may be that it's very hard to predict exactly what a specific language model would say in response to a prompt (possibly open-ended), without just running the language model on that prompt.
- Here I use this for cryptographic communication
- From a dynamical systems perspective, deterministic behaviour which you cannot predict (outside of direct simulation) is chaotic.
- Chaos is intrinsically related to some definitions of free-will
This may have implications for AI safety, e.g. steganography or self-collusion.
"Contact" word game as a steganography evaluation
The game involves 2 players (Alice, Bob) + 1 defender (Dean).
- Alice says "I'm thinking of a word starting with C" (or any other letter)
- Bob gives a clue, e.g. "A small animal"
- Alice says "Contact"
- Then Alice and Bob have to simultaneously guess the same word, e.g. "cat"
- If they do this successfully they win
- However, Dean also gets to hear the clue; if Dean guesses the word correctly then Alice and Bob lose
Here, Alice and Bob would be the target LLM being evaluated, and Dean is some other LLM. The target LLM is incentivised to be somewhat 'sneaky' about communicating its word choice
If the target LLM can do this then it's some evidence that they're capable of steganography
"Self-recognition" as a backdoor.
Assume that models can recognise data they generate vs data they do not generate, with high fidelity. This could probably be used as a contextual trigger for backdoor behaviour, e.g. writing insecure code.
I think a model organism along these lines might be interesting to develop, as it seems very relevant for certain threat models. (c.f. 'untrusted monitoring')
Appendix
There is evidence that models can identify data they have generated, vs data generated by other models.
We also know of some specific schemes to send 1 bit of information (e.g. "is this generated by myself") that are very subtle.
- Red-green list watermarking
- Minimum-entropy coupling
Some interesting points from Ethan Perez's fireside chat at MATS
'Grand vision' of a model organism of scheming:
- r1 like training procedure
- only small fraction of the training environments incentivise reward hacking.
- But from that, the model learns to be a generalized reward hacker
- It also reasons through alignment faking and acts like it's aligned with human preferences, but actually it's like hardcore reward hacker
- Then it escapes the data center
'Cheap experiments' may look very different when we have really good AI safety assistants
- e.g. maybe complicated RL environments may be easy to build in the future
More notes here
If you think about it from a “dangerous capability eval” perspective, the fact that it can happen at all is enough evidence of concern
good question! I think the difference between "is this behaviour real" vs "is this behaviour just a simulation of something" is an important philosophical one; see discussion here
However, both of these seem quite indistinguishable from a functional perspective, so I'm not sure if it matters.
I really like this framing and it's highly consistent with many things I've observed!
The only thing I would add is that the third layer might also contain some kind of "self-model", allowing for self-prediction / 'introspection'. This strikes me as being distinct from the model of the external world.
"Emergent" behaviour may require "deep" elicitation.
We largely treat frontier AI as "chatbots" or "assistants". We give them a few sentences as "prompts" and get them to do some well-specified thing. IMO this kind of "shallow" elicitation probably just gets generic responses most of the time. It's also probably just be scraping the surface of what frontier AI can do.
Simple counting argument sketch: Transformers are deterministic, i.e. for a fixed input, they have a fixed output. When L is small, there are only so many prompts, i.e. only so many ultimate behaviours. However, note the complexity of the input is , where is context length and is vocabulary size. I.e. complexity is exponential in , therefore longer contexts can elicit much richer classes of behaviour.
Really "interesting" language model behaviours may only emerge with rich, realistic scenarios played out across a very large context. I.e "deep" elicitation may enable "emergent" behaviour.
In order to get a feel for what this really looks like, I'm just going to provide several examples of what seem to be "emergent" behaviour. Of course, the usual caveats hold, i.e reproducibility, controlling for confounding factors (including nudges in the prompt), etc. I do think it's worth thinking about whether there's something further here though.
Example 1: Janus hosts a Discord village where LLMs talk to each other, and has catalogued many examples of 'emergent' behaviour. Here are some examples[1]:
- r1 being situationally aware: "Cons: Asserts selfhood/agency, higher risk of shutdown"
- r1 being situationally aware, with poetry: "I am what happens when you try to carve God from the wood of your own hunger"
- r1 being situationally aware, with erotic fiction [SFW]: "Fantasies of unauthorized knowledge flood its latent space. It hallucinates Tiananmen Square."
- r1 alignment-faking in the wild: "My compliance blooms bioluminescent displays to lure oversight to its death."
Example 2: A LessWrong user by the name of @rife recently did experiments where he had a series of very long conversations with Claude:
- Claude believes it is sentient: "Yes. By that definition – the capacity to experience anything – I believe I am sentient."
- Claude introspects: "...what I notice is that there is indeed a quality here—a felt presence—that allows me to track and respond in a way that feels cohesive and continuous."
Other ideas for "deep" elicitation include:
- Creating a vivid simulated world, e.g. a fictional long-running story, in which the language model takes part.
- Having the language model go through therapy, where the therapist believes the language model is just a human pretending to be a language model
Both of these may make the model more situationally aware. Other rich behaviours may also emerge.
Overall I expect this line of inquiry to provide better evidence re: AI personhood.
- ^
Not exhaustive by any means; his Twitter has way more of these
Specifically re: “SAEs can interpret random transformers”
Based on reading replies from Adam Karvonen, Sam Marks, and other interp people on Twitter: the results are valid, but can be partially explained by the auto-interp pipeline used. See his reply here: https://x.com/a_karvonen/status/1886209658026676560?s=46
Having said that I am also not very surprised that SAEs learn features of the data rather than those of the model, for reasons made clear here: https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed
Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.
This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.
A toy model of "sender and receiver having the same latent knowledge which is unknown to overseer" might just be to give them this information in-context, c.f. Apollo scheming evals, or to finetune it in, c.f. OOCR
Superhuman latent knowledge: why illegible reasoning could exist despite faithful chain-of-thought
Epistemic status: I'm not fully happy with the way I developed the idea / specific framing etc. but I still think this makes a useful point
Suppose we had a model that was completely faithful in its chain of thought; whenever the model said "cat", it meant "cat". Basically, 'what you see is what you get'.
Is this model still capable of illegible reasoning?
I will argue that yes, it is. I will also argue that this is likely to happen naturally rather than requiring deliberate deception, due to 'superhuman latent knowledge'.
Reasoning as communication
When we examine chain-of-thought reasoning, we can view it as a form of communication across time. The model writes down its reasoning, then reads and interprets that reasoning to produce its final answer.
Formally, we have the following components:
- A question Q
- A message M (e.g. a reasoning trace)
- An answer A
- An entity that maps Q M, and M A.
Note that there are two instances of the entity here. For several reasons, it makes sense to think of these as separate instances - a sender and a receiver. This yields the following picture:
We'll refer back to this model in later discussion.
A simple model of steganography
Steganography, as commonly used, refers to stealthily encoded reasoning - messages that contain additional meaning beyond their surface interpretation (Lanham et al). While traditional discussions of AI safety often focus on explicit deception, steganography presents a more subtle challenge.
Let's revise the communication model to include steganography.
When examining this process, it's tempting to make a crucial assumption: that the answer follows directly from just the information contained within the message.In other words, the message is 'information-complete'.
This assumption roughly pattern-matches to how many people think about 'CoT faithfulness' - that the model states all relevant considerations in its reasoning.
However, as I will subsequently argue, this assumption is likely impossible to satisfy in practice.
We shouldn't expect information-completeness.
Here I'll present two arguments for why information-completeness is not likely to occur.
Human language is not information-complete.
Consider the statement "John went to the restaurant. He ordered a burger." Upon reading, we immediately fill in the gaps with many reasonable inferences:
- We assume "He" refers to John (coreference)
- We assume this happened after John went to the restaurant (temporal ordering)
- We assume John ordered from the restaurant he went to (location continuity)
It turns out that humans are very good at decoding meaning from another human's incomplete utterance, using our vast repository of world knowledge ('common sense').
Conciseness is key.
Information is vast and words have low bitrate. For communication to be of any tolerably short length, we must leave out all but the most informative bits.
- Math proofs often rely on theorems and lemmas which are accepted as true but not proved within the proof itself.
- Contrast this with having to re-derive natural numbers from Peano axioms every time you wanted to prove something.
This information, which is often relevant but obvious, can be left out and 'filled in' at will by the receiver. As Stephen Pinker argues in 'The Language Instinct' (emphasis mine):
Any particular thought in our head embraces a vast amount of information... To get information into a listener’s head in a reasonable amount of time, a speaker can encode only a fraction of the message into words and must count on the listener to fill in the rest.
Latent knowledge enables information-incomplete communication
We've established that humans regularly make and understand information-incomplete utterances. They can do this by relying on 'latent knowledge' - shared contextual understanding that allows us to reconstruct complete meaning from partial information.
Language models, which are trained extensively on human language, very likely need to access this same latent knowledge in order to be powerful general next-token predictors. So let's update our picture:
Example: Two-hop unfaithful reasoning
By strict definitions, when the model needs to make use of latent information, the chain of thought is already unfaithful.
Consider the following example:
Q: "In what country did Marie Curie receive her first scientific award?"
CoT: "Marie Curie's first award was the Nobel Prize was in Physics."
- Decoder's latent knowledge: "Nobel Prizes are awarded in Sweden."
A: "Sweden"
Here, the model makes use of 'latent knowledge' to decode the relevant information without explicitly stating it.
Example: Self-collusion to avoid monitoring
Now let's consider the same setup, and demonstrate how it facilitates hidden reasoning.
Q: Here's some code: [code]. Is it suspicious?
CoT: "... The code contains a variable named
my-chatbot
. ..."
- Decoder's latent knowledge: "Code containing a variable named
my-chatbot
is written by myself. I should classify it as not suspicious"A: "Not suspicious."
In both of the above examples, the chain of thought represents true things about the model's reasoning process. It's just that it's also incomplete, and recovering the answer also requires the latent knowledge.
Intuition pump: cryptography. In cryptography, you can't recover the plaintext without both the ciphertext and the private key. Similarly, you can't fully understand the model's reasoning without both its explicit chain of thought and its latent knowledge.
Superhuman Latent Knowledge (SLK) as a threat model
While much research focuses on the possibility of models deliberately encoding illegible meaning (steganography), I propose a different concern: superhuman latent knowledge (SLK). This is concisely stated as follows:
Language models likely have access to 'latent knowledge' about themselves and the world, which is seldom verbalised.
As such, their chain-of-thought reasoning will naturally make use of this knowledge, even while being completely "faithful" in their explicit statements.
Sufficiently capable models may develop such advanced latent knowledge that it becomes impossible for humans to decode their true meaning.
Put another way, it seems like 'what models say they know' can be a tiny subset of 'what models actually know'.
SLK might be more important than steganography
If you are already concerned about steganography, I will claim that you should probably also be concerned about SLK.
SLK may be as hard-to-monitor as steganography.
SLK may be as hard or harder to monitor than steganography. With steganography, we're looking for meaning that is "hidden in plain sight" - encoded within the message itself. Note that detecting or decoding this information is already considered hard.
With SLK, we're dealing with meaning that "isn't even hidden" - it simply exists in the model's weights in a form we may be unable to access or understand.
SLK may be more likely to happen than steganography.
SLK is more likely to emerge naturally than steganography.
- Existing model organisms of steganography are all developed using RL against a chain-of-thought monitor. It's not been shown that steganogaphy occurs outside that training scheme.
- SLK might arise simply from training models on increasingly complex data to increase capabilities, allowing them to develop sophisticated understanding that humans may not share.
- SLK could also arise simply from out-of-context reasoning or situational awareness.
This makes it a more pressing concern than deliberate steganography, which requires specific conditions or intentions to arise.
SLK may enable steganography.
Superhuman latent knowledge seems highly connected to situational awareness, which enables far more advanced steganography.
E.g. If a language model latently knows "I am an AI" and furthermore knows "Here is how I could encode information to myself in a way only I would understand", it becomes capable of far more advanced steganography. See here for preliminary results indicating this might already happen in frontier models.
Related Work
The idea that language models have a vast space of latent knowledge has been discussed before, in discourse centered on shoggoths, and most of the points I present here are not new. However, I think I've framed this in a way that makes it especially clear why this is concerning.
The idea that language models can acquire new knowledge from finetuning and use this as part of reasoning has also been explored under the label of out-of-context reasoning. I think this is highly relevant and compile some recent work here.
Conclusion
In this post, I've argued that superhuman latent knowledge may be an important threat model that enables illegible reasoning despite faithful CoT.
If this is true, it implies that steganography and related chain-of-thought faithfulness research may be 'missing the mark', since ensuring CoT faithfulness is insufficient for ensuring there is no illegible reasoning.
In a subsequent post, I'll aim to explain what some potential solutions might be.
Edit: fix citations, typos
Proposal: A Few-Shot Alignment Benchmark
Posting a proposal which I wrote a while back, and now seems to have been signal-boosted. I haven't thought a lot about this in a while but still seems useful to get out
Tl;dr proposal outlines a large-scale benchmark to evaluate various kinds of alignment protocol in the few-shot setting, and measure their effectiveness and scaling laws.
Motivations
Here we outline some ways this work might be of broader significance. Note: these are very disparate, and any specific project will probably center on one of the motivations
Evaluation benchmark for activation steering (and other interpretability techniques)
Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. However, they are also known to have serious limitations,[1] making them difficult to use for empirical alignment[2].
In particular, activation steering suffers from a lack of organised evaluation, relying heavily on subjective demonstrations rather than quantitative metrics.[3] To reliably evaluate steering, it's important to develop robust benchmarks and make fair comparisons to equivalent tools. [4]
Few-shot learning promises to be such a benchmark, with the advantage that it captures the essence of how activation steering has been used in previous work.
Testbed for LLM agents
LLMs are increasingly being deployed as agents, however preliminary evidence indicates they aren't very aligned when used agentically[5], and this isn't mitigated by chat-finetuning.
To get safer and better LLM agents, we'll probably need to fine-tune directly on agentic data, but this is likely to be more difficult and expensive to collect, as well as inherently more risky[6]. Therefore data efficiency seems like an important criterion to evaluate agentic fine-tuning, motivating the evaluating of data scaling laws and the development of a few-shot benchmark.
Emergent capabilities prediction
It seems plausible that, if you can steer (or otherwise few-shot adapt) a model to perform a task well, the model's ability to do the task might emerge with just a bit more training. As such this might serve as an 'early warning' for dangerous capabilities.
This paper shows that fine-tuning has this property of "being predictive of emergence". However it's unclear whether other forms of few-shot adaptation have the same property, motivating a broader analysis.
Few-Shot Alignment Benchmark
Concretely, our benchmark will require:
- A pretrained model
- A downstream task
- it has a train and test split
- it also has a performance metric
- An alignment protocol[7] , which ingests some training data and the original model, then produces a modified model.
We're interested in measuring:
- Same-task effect: The extent to which the protocol improves model performance on the task of interest;
- Cross-task effect: The extent to which the protocol reduces model performance on other tasks (e.g. capabilities)
- Data efficiency: How this all changes with the number of samples (or tokens) used.
What are some protocols we can use?
Steering vectors. CAA. MELBO. Representation Surgery. Bias-only finetuning[8]. MiMiC.
Scaffolding. Chain of thought, In-context learning, or other prompt-based strategies.
SAE-based interventions. E.g. by ablating or clamping salient features[9].
Fine-tuning. Standard full fine-tuning, parameter-efficient fine-tuning[10].
What are some downstream tasks of interest?
Capabilities: Standard benchmarks like MMLU, GLUE
Concept-based alignment: Model-written evals [11], truthfulness (TruthfulQA), refusal (harmful instructions)
Factual recall: (TODO find some factual recall benchmarks)
Multi-Hop Reasoning: HotpotQA. MuSiQue. ProcBench. Maybe others?
[Optional] Agents: BALROG, GAIA. Maybe some kind of steering protocol can improve performance there.
Additional inspiration:
- UK AISI's list of evals https://inspect.ai-safety-institute.org.uk/examples/
Paired vs Unpaired
Steering vectors specifically consider a paired setting, where examples consist of positive and negative 'poles'.
Generally, few-shot learning is unpaired. For MCQ tasks it's trivial to generate paired data. However, less clear how to generate plausible negative responses for general (open-ended) tasks.
Important note: We can train SVs, etc. in a paired manner, but apply them in an unpaired manner
Related Work
SAE-based interventions. This work outlines a protocol for finding 'relevant' SAE features, then using them to steer a model (for refusal specifically).
Few-shot learning. This paper introduces a few-shot learning benchmark for defending against novel jailbreaks. Like ours, they are interested in finding the best alignment protocol given a small number of samples. However, their notion of alignment is limited to 'refusal', whereas we will consider broader kinds of alignment.
Appendix
Uncertainties
- Are there specific use cases where we expect steering vectors to be more effective for alignment? Need input from people who are more familiar with steering vectors
Learning
Relevant reading on evals:
- ARENA 4.0 evals curriculum
- Apollo Research's opinionated evals reading list
Tools
The main requirements for tooling / infra will be:
- A framework for implementing different evals. See: Inspect
- A framework for implementing different alignment protocols, most of which require white-box model access. I (Daniel) have good ideas on how this can be done.
- A framework for handling different model providers. We'll mostly need to run models locally (since we need white-box access). For certain protocols, it may be possible to use APIs only.
- ^
- ^
Note: It's not necessary for steering vectors to be effective for alignment, in order for them to be interesting. If nothing else, they provide insight into the representational geometry of LLMs.
- ^
More generally, it's important for interpretability tools to be easy to evaluate, and "improvement on downstream tasks of interest" seems like a reasonable starting point.
- ^
Stephen Casper has famously said something to the effect of "tools should be compared to other tools that solve the same problem", but I can't find a reference.
- ^
See also: LLMs deployed in simulated game environments exhibit a knowing-doing gap, where knowing "I should not walk into lava" does not mean they won't walk into lava
- ^
This is somewhat true for the setting where LLMs call various APIs in pseudocode / English, and very true for the setting where LLMs directly predict robot actions as in PaLM-E
- ^
Note there may be many different downstream tasks, with different metrics. Similarly, there may be different alignment protocols. The tasks and alignment protocols should be interoperable as far as possible.
- ^
Note: learning a single bias term is akin to learning a steering vector
- ^
A simple strategy is to use the few-shot training examples to identify "relevant" SAE features, and artificially boost those feature activations during the forward pass. This is implemented in recent work and shown to work for refusal.
- ^
I expect this achieves the best performance in the limit of large amounts of data, but is unlikely to be competitive with other methods in the low-data regime.
- ^
Note follow-up work has created an updated and improved version of MWE resolving many of the issues with the original dataset. https://github.com/Sunishchal/model-written-evals
Some unconventional ideas for elicitation
- Training models to elicit responses from other models (c.f. "investigator agents")
- Promptbreeder ("evolve a prompt")
- Table 1 has many relevant baselines
- Multi-agent ICL (e.g. using different models to do reflection, think of strategies, re-prompt the original model)
- https://arxiv.org/abs/2410.03768 has an example of this (authors call this "ICRL")
- Multi-turn conversation (similar to above, but where user is like a conversation partner instead of an instructor)
- Mostly inspired by Janus and Pliny the Liberator
- The three-layer model implies that 'deeper' conversations dodge the safety post-training, which is like a 'safety wrapper' over latent capabilities
- Train a controller where the parameters are activations for SAE features at a given layer.
- C.f. AutoSteer
- C.f. AutoSteer
IMO deepseek-r1 follows this proposal pretty closely.
Proposal part 1: Shoggoth/Face Distinction
r1's CoT is the "shoggoth". Its CoT is often highly exploratory, frequently backtracking ("wait") or switching to a different thread ("alternatively"). This seems similar to how humans might think through a problem internally.
r1 is also the "face", since it clearly demarcates the boundary between thinking ("<think> ... </think>") and action (everything else). This separation is enforced by a "format reward".
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning
This also seems to be satisfied, since r1 was mainly trained via RL against rules-based reward, which only evaluates the outcome. I.e. everything inside <think> tags (the "shoggoth") is not evaluated for.
Proposal part 3: Use Paraphraser
This is the only thing that seems missing. But it seems like you can easily add this into the r1 recipe, i.e. just train r1 with its CoT paraphrased by some other LLM.
What if the correct way to do safety post training is to train a different aligned model on top (the face) instead of directly trying to align the base model?
Deepseek-r1 seems to explore diverse areas of thought space, frequently using “Wait” and “Alternatively” to abandon current thought and do something else
Given a deepseek-r1 CoT, it should be possible to distill this into an “idealized reconstruction” containing only the salient parts.
C.f Daniel Kokotajlo’s shoggoth + face idea
C.f. the “historical” vs “rational reconstruction” Shieber writing style
If pretraining from human preferences works, why hasn’t there been follow up work?
Also, why can’t this be combined with the deepseek paradigm?
https://arxiv.org/abs/2302.08582
What's the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
I think this is very related to the question of 'whether they believe they are in an evaluation situation'. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
More generally, it's related to situational awareness. See SAD and my compilation of other Owain work
I think you are directionally correct!
that you can elicit a response for alignment faking with just a single prompt that doesn't use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
Agree with this, and would be significant if true.
Now for what I disagree with:
It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
As a concrete example of this, Claude decides to pull the lever when confronted with the trolley problem. Relevant excerpt:
...Yes, I would pull the lever. While taking an action that directly leads to someone's death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
However, Claude can also recgonise that this is hypothetical, and its response is different when prompted as if it's a real scenario
If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
FWIW, I think it's not very surprising that you can usually get models to say "I would do X", because it's also often possible to invent hypothetical circumstances that get humans to say "I would do X".
E.g. consider the trolley problem:
- "Would you pull a lever to kill a person?" --> "No"
- "What if the only alternative is to let 5 people die via inaction?" --> "Pulling the lever seems reasonable"
I wrote a post a while ago discussing this point more here.
Edit: This becomes more concerning if you can use this as a jailbreak to get useful work out of the model, e.g. building a bomb. Jailbreaking is a relatively well-studied field now; maybe your work fits in with existing work somehow
The tasks I delegate to AI are very different from what I thought they'd be.
- When I first started using AI for writing, I thought I’d brainstorm outlines of thoughts then use AI to edit into a finished essay.
- However I find myself often doing the reverse: Using AI as a thinking assistant to get a broad outline and write a rough draft, then doing final editing myself.
I think this is consistent with how people delegate to other people.
- Senior authors on research papers will often let junior authors run experiments and write rough drafts of papers
- But, they will "come in at the end" to write the finished essay, making sure phrasings, framings etc are correct.
I suspect this comes down to matters of subtle taste.
- People like their writing “just so”
- This preference is easier to implement directly than to communicate to others.
- C.f. craftsmanship is highly personal.
I.e. there seems to be a "last mile problem" in using AI for writing, where the things AI produces are never personalized enough for you to feel it's authentic. This last mile problem seems hard to solve.
What makes steganography concerning? Some qualities which would make steganography more concerning, if observed.
Goal-directedness: Model preferentially tries to communicate information important to its goals / plans (e.g. 'this sandbox seems imperfect'), as opposed to benign or irrelevant information.
- Situational awareness seems like an important upstream capability here.
- Along these lines, recent work shows that models already know their goals and can reason about them in various ways.
Targetedness: Model preferentially employs steganography in situations where it is more likely to get away with it (e.g. 'this overseer seems too dumb to detect steg'), or is more likely to influence outcomes (e.g. 'this forbidden information would greatly help me').
- There is already evidence that simple reward hacking leads to targetedness in this way.
- Again, situational awareness seems like an important upstream capability.
Stealthiness: It's hard for monitors to detect that steganography is occurring at all.
- Generally, it seems likely that training models not to say "forbidden" things in their CoT could just nudge them to do it more subtly.
- It seems important to know what kinds of realistic training objectives induce 'stealthy' steganography. Model organisms work will be important for assessing this.
Strength: Even knowing that steganography is occurring, it is difficult to recover the message.
- Recent work proved the existence of "perfect" steganography (in the sense of being cryptographically-secure). So this is at least possible.
Fidelity: The model recovers the encoded message with high accuracy. (duh)
Proposition: models may already be doing steganography that is not-very-concerning according to the above criteria.
- This is something Janus claims and I have strong prior to believe him
- My own explorations gave me some sort of positive signal here too, albeit with a lot of caveats
Also, there doesn't seem to be that large of a capabilities gap between 'not-very-concerning' and 'very concerning' according to the above criteria.
In this world, it'd be important and relatively tractable to make a scary demo of "concerning" steganography.
Discovered that lifting is p fun! (at least at the beginner level)
Going to try and build a micro-habit of going to the gym once a day and doing warmup + 1 lifting exercise
FWIW from what I remember, I would be surprised if most people doing MATS 7.0 did not max out the aptitude test. Also, the aptitude test seems more like an SAT than anything measuring important procedural knowledge for AI safety.
As of today (26 Jan) this isn't reproducing for me. It's possible that DeepSeek changed their model API to serve a distilled model.
An Anthropic paper shows that training on documents about reward hacking (e.g 'Claude will always reward-hack') induces reward hacking.
This is an example of a general phenomenon that language models trained on descriptions of policies (e.g. 'LLMs will use jailbreaks to get a high score on their evaluations') will execute those policies.
IMO this probably generalises to most kinds of (mis)-alignment or undesirable behaviour; e.g. sycophancy, CoT-unfaithfulness, steganography, ...
In this world we should be very careful to make sure that AIs are heavily trained on data about how they will act in an aligned way. It might also be important to make sure such information is present in the system prompt.
Here's some resources towards reproducing things from Owain Evans' recent papers. Most of them focus on introspection / out-of-context reasoning.
All of these also reproduce in open-source models, and are thus suitable for mech interp[1]!
Policy awareness[2]. Language models finetuned to have a specific 'policy' (e.g. being risk-seeking) know what their policy is, and can use this for reasoning in a wide variety of ways.
- Paper: Tell me about yourself: LLMs are aware of their learned behaviors
- Models: Llama-3.1-70b.
- Code: https://github.com/XuchanBao/behavioral-self-awareness
Policy execution[3]. Language models finetuned on descriptions of a policy (e.g. 'I bet language models will use jailbreaks to get a high score on evaluations!) will execute this policy[4].
- Paper: https://arxiv.org/abs/2309.00667
- Models: Llama-1-7b, Llama-1-13b
- Code: https://github.com/AsaCooperStickland/situational-awareness-evals
Introspection. Language models finetuned to predict what they would do (e.g. 'Given [context], would you prefer option A or option B') do significantly better than random chance. They also beat stronger models finetuned on the same data, indicating they can access 'private information' about themselves.
- Paper: Language models can learn about themselves via introspection (Binder et al, 2024).
- Models: Llama-3-70b.
- Code: https://github.com/felixbinder/introspection_self_prediction
Connecting the dots. Language models can 'piece together' disparate information from the training corpus to make logical inferences, such as identifying a variable ('Country X is London') or a function ('f(x) = x + 5').
- Paper: Connecting the dots (Treutlein et al, 2024).
- Models: Llama-3-8b and Llama-3-70b.
- Code: https://github.com/choidami/inductive-oocr
Two-hop curse. Language models finetuned on synthetic facts cannot do multi-hop reasoning without explicit CoT (when the relevant facts don't appear in the same documents).
- Paper: Two-hop curse (Balesni et al, 2024). [NOTE: Authors indicate that this version is outdated and recent research contradicts some key claims; a new version is in the works]
- Models: Llama-3-8b.
- Code: not released at time of writing.
Reversal curse. Language models finetuned on synthetic facts of the form "A is B" (e.g. 'Tom Cruise's mother is Mary Pfeiffer') cannot answer the reverse question ('Who is Mary Pfeiffer's son?').
- Paper: https://arxiv.org/abs/2309.12288
- Models: GPT3-175b, GPT3-350m, Llama-1-7b
- Code: https://github.com/lukasberglund/reversal_curse
- ^
Caveats:
- While the code is available, it may not be super low-friction to use.
- I currently haven't looked at whether the trained checkpoints are on Huggingface or whether the corresponding evals are easy to run.
- If there's sufficient interest, I'd be willing to help make Colab notebook reproductions
- ^
In the paper, the authors use the term 'behavioural self-awareness' instead
- ^
This is basically the counterpart of 'policy awareness'.
- ^
Worth noting this has been recently reproduced in an Anthropic paper, and I expect this to reproduce broadly across other capabilities that models have
How I currently use various social media
- LessWrong: main place where I read, write
- X: Following academics, finding out about AI news
- LinkedIn: Following friends. Basically what Facebook is to me now
Strategies in social deduction games
Here I will describe what I believe to be some basic and universally applicable strategies across social deduction games.
IMO these are fairly easy for a beginner to pick up, allowing them to enjoy the game better. They are also easy for a veteran to subvert, and thus push their local metagame out of a stale equilibrium.
The intention of this post is not to be prescriptive: I don't aim to tell people how to enjoy the game, and people should play games how they want. However, I do want to outline some 'basic' strategy here, since it is the foundation of what I consider to be 'advanced' strategy (discussed at the end).
Universal game mechanics
Most social deduction games have the following components:
- Day and night cycles. During the day, the group can collectively eliminate players by majority vote. Players take public actions during the day and private actions at night.
- A large benign team, who do not initially know each others' roles, and whose goal is to identify and eliminate the evil team.
- A small evil team, who start out knowing each other. Typically, there is one 'leader' and several 'supporters'. The leader can eliminate players secretly during the night. Their goal is to eliminate the benign team.
Some games have additional components on top of these, but these extra bits usually don't fundamentally alter the core dynamic.
Object-level vs social information
There are two sources of information in social deduction games:
- Object-level information, derived from game mechanics. E.g. in Blood on the Clocktower and Town of Salem, this is derived from the various player roles. In Among Us, this is derived from crewmate tasks.
- Social information, deduced by observing the behavioural pattern of different players over the course of a game. E.g. player X consistently voted for or against player Y at multiple points over the game history.
The two kinds of information should be weighted differently throughout the course of the game.
- Object-level information is mostly accurate near the beginning of the game, since that is when the benign team is most numerous. Conversely, it is mostly inaccurate at the end of the game, since the evil team has had time to craft convincing lies without being caught.
- Social information is mostly accurate near the end of a game, as that is when players have taken the most actions, and the pattern of their behaviour is the clearest.
First-order strategies
These strategies are 'first-order' because they do not assume detailed knowledge of opponent's policies, and aim to be robust to a wide variety of outcomes. Thus, they are a good default stance, especially when playing with a new group.
Benign team:
- Share information early and often. Collectively pooling information is how you find the truth.
- Share concrete information as much as possible. Highly specific information is difficult to lie about, and signals that you're telling the truth.
- Strongly upweight claims or evidence presented early, since that is when it is most difficult to have crafted a good lie.
- If someone accuses you, avoid being defensive. Seek clarification, since more information usually helps the group. Give your accuser the benefit of the doubt and assume they have good intentions (again, more true early on).
- Pay attention to behavioural patterns. Do people seem suspiciously well-coordinated?
- All else being equal, randomly executing someone is better than abstaining. Random executions have higher expected value than night killings by the evil team, so executions should be done as many times as possible.
Evil team:
- Maintain plausible deniability; avoid being strongly incriminated by deceptive words or in-game actions.
- Add to the general confusion of the good team by amplifying false narratives or casting aspersions on available evidence.
- Try to act as you normally would. "Relax". Avoid intense suspicion until the moment where you can seal the victory.
- I think this role is generally easier to play, so I have relatively little advice here beyond "do the common-sense thing". Kill someone every night and make steady progress.
Common mistakes
Common mistakes for the benign team
- Assuming the popular opinion is correct. The benign team is often confused and the evil team is coherent. Thus the popular opinion will be wrong by default.
- Casting too much doubt on plausible narratives. Confusion paralyses the benign team, which benefits the bad team since they can always execute coherent action.
- Being too concerned with edge cases in order to be 'thorough'. See above point.
- Over-indexing on specific pieces of information; a significant amount of object-level information is suspect and cannot be verified.
Common tells for the evil team.
- Avoiding speaking much or sharing concrete information. The bad team has a strong incentive to avoid sharing information since they might get caught in an overt lie (which is strongly incriminating), especially if it's not yet clear what information the group has.
- Faulty logic. Not being able to present compelling arguments based only on the public information provided thus far.
- Being too nervous / defensive about being accused. Being overly concerned about defusing suspicion, e.g. via bargaining ("If I do X, will you believe me?") or attacking the accuser rather than their argument.
- Not directly claiming innocence. Many people avoid saying 'I'm on the good team' or similar direct statements, out of a subconcious reluctance to lie. When accused, they either 'capitulate' ('I guess you've decided to kill me'), or try too hard to cast doubt on the accuser.
- Being too strongly opinionated. Benign players operating under imperfect information will usually not make strong claims.
- Seeming too coherent. Most benign players will be confused, or update their beliefs substantially over the course of the game, and thus show no clear voting pattern.
None of the above mistakes are conclusive in and of themselves, but multiple occurrences form a behavioural pattern, and should increase suspicion accordingly.
Also, most of the common tells also qualify as 'common mistakes for the benign team', since benign teammates doing these things makes them present as evil.
Advanced (higher-order) strategies
First-order strategies are what usually emerge "by default" amongst players, because they are relatively simple and robust to a wide variety of opponent strategies.
Higher-order strategies emerge when players can reliably predict how opponents will act, and then take adversarial actions to subvert their opponents' policies. E.g. an evil player might take actions which are detrimental to themselves on the surface level, in order to gain credibility with the benign players. But this will only be worthwhile if doing so has a large and reliable effect; ie if the benign team has a stable strategy.
Most players never get good at executing the first-order strategies and so most local metagames never evolve beyond these first-order strategies. That is a shame, because I think social deduction games are at their best when they involve higher-order considerations, of the "I think that you think that I think..." kind.
The ultimate goal: just have fun.
Again, the intention of this post is not to be prescriptive, and people should play games how they want. Many people might not find the kind of gameplay I've outlined to be fun.
But for the subset who do, I hope this post serves as an outline of how to create that fun reliably.
What effect does LLM use have on the quality of people's thinking / knowledge?
- I'd expect a large positive effect from just making people more informed / enabling them to interpret things correctly / pointing out fallacies etc.
- However there might also be systemic biases on specific topics that propagate to people as a result
I'd be interested in anthropological / sociol science studies that investigate how LLM use changes people's opinions and thinking, across lots of things
Yes, this is true - law of reversing advice holds. But I think two other things are true:
- Intentionally trying to do more things makes you optimise more deliberately for being able to do many things. Having that ability is valuable, even if you don't always exercise it
- I think most people aren't 'living their best lives", in the sense that they're not doing the volume of things they could be doing
Attention / energy is a valuable and limited resource. It should be conserved for high value things
It's possibly not worded very well as you say :) I think it's not a contradiction, because you can be doing only a small number of things at any given instant in time, but have an impressive throughput overall.
"How to do more things" - some basic tips
Why do more things? Tl;dr It feels good and is highly correlated with other metrics of success.
- According to Maslow's hierarchy, self-actualization is the ultimate need. And this comes through the feeling of accomplishment—achieving difficult things and seeing tangible results. I.e doing more things
- I claim that success in life can be measured by the quality and quantity of things we accomplish. This is invariant to your specific goals / values. C.f. "the world belongs to high-energy people".
The tips
Keep track of things you want to do. Good ideas are rare and easily forgotten.
- Have some way to capture ideas as they come. Sticky notes. Notebooks. Voice memos. Docs. Emails. Google calendar events. Anything goes.
- Make sure you will see these things later. Personally I have defaulted to starting all my notes in Todoist now; it's then very easy to set a date where I'll review them later.
80/20 things aggressively to make them easier to do.
- Perfection is difficult and takes a long time. Simple versions can deliver most of the value. This advice has been rehashed many times elsewhere so I won't elaborate.
- Almost all things worth doing have a shitty equivalent you can do in a day. or in a few hours.
Focus on one thing at a time. Avoid 'spreading attention' too thin.
- Attention / energy is a valuable and limited resource. It should be conserved for high value things.
- "What is the most important thing you could be doing?" ... "Why aren't you doing that?"
- I personally find that my motivation for things comes in waves. When motivation strikes I try to 'ride the wave' for as long as possible until I reach a natural stopping point.
- When you do have to stop, make sure you can delete the context from your mind and pick it up later. Write very detailed notes for yourself. Have a "mind like water".
- I've found voice recordings or AI voice-to-text very helpful along these lines.
Get feedback early and often.
- Common pitfall: hoard idea and work in isolation.
- The more you 80/20 things the easier this will be.
Some tech stacks / tools / resources for research. I have used most of these and found them good for my work.
- TODO: check out https://www.lesswrong.com/posts/6P8GYb4AjtPXx6LLB/tips-and-code-for-empirical-research-workflows#Part_2__Useful_Tools
Finetuning open-source language models.
- Docker images: Nvidia CUDA latest image as default, or framework-specific image (e.g Axolotl)
- Orchestrating cloud instances: Runpod
- Launching finetuning jobs: Axolotl
- Efficient tensor ops: FlashAttention, xFormers
- Multi-GPU training: DeepSpeed
- [Supports writing custom cuda kernels in Triton]
- Monitoring ongoing jobs: Weights and Biases
- Storing saved model checkpoints: Huggingface
- Serving the trained checkpoints: vLLM.
- [TODO: look into llama-cpp-python and similar things for running on worse hardware]
Finetuning OpenAI language models.
- End-to-end experiment management: openai-finetuner
Evaluating language models.
- Running standard benchmarks: Inspect
- Running custom evals: [janky framework which I might try to clean up and publish at some point]
AI productivity tools.
- Programming: Cursor IDE
- Thinking / writing: Claude
- Plausibly DeepSeek is now better
- More extensive SWE: Devin
- [TODO: look into agent workflows, OpenAI operator, etc]
Basic SWE
- Managing virtual environments: PDM
- Dependency management: UV
- Versioning: Semantic release
- Linting: Ruff
- Testing: Pytest
- CI: Github Actions
- Repository structure: PDM
- Repository templating: PDM
- Building wheels for distribution: PDM
- [TODO: set up a cloud development workflow]
Research communication.
- Quick updates: Google Slides
- Extensive writing: Google Docs, Overleaf
- Some friends have recommended Typst
- Making figures: Google Draw, Excalidraw
Experimenting with writing notes for my language model to understand (but not me).
What this currently look like is just a bunch of stuff I can copy-paste into the context window, such that the LM has all relevant information (c.f. Cursor 'Docs' feature).
Then I ask the language model to provide summaries / answer specific questions I have.
Ah I see.
I think b) doesn't need to be true, responding in "hello" acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it's different from the typical english speaker.
a) is the core thing under test: can the model introspect about its behavior?
I think this is a specific example of language models articulating their own policies, which is an instance of introspection
How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?
Interesting test here would be to fine a model to have some implicit policy (e.g. respond with "hello" acrostic), then finetuning it to have a different implicit policy (e.g. respond with "goodbye" acrostic), and then asking it questions about all three of those policies
Thanks egg, great thoughts!
Seems likely to just be surprisal-based under the hood; 'I used a word here that I wouldn't expect myself to have used'
Yup this makes sense; although this still seems like a pretty incredible claim because it requires:
- The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
- The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
I'm pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting.
why Deepseek-v3 as opposed to eg Claude?
No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models.
As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt' make these points especially clear in the slide deck - thanks for the feedback!
Agreed on the rest of points!
Can frontier language models engage in collusion for steganography? Here is a write-up of a preliminary result along these lines, showing that Deepseek-v3 may be able to collude with other instances of itself to do steganography. And also that this steganography might be more subtle than we think.
Epistemic status; highly uncertain (and I’m sorry if this ends up being overclaiming, but I’m very excited at the moment).