Posts
Comments
Yes, this is true - law of reversing advice holds. But I think two other things are true:
- Intentionally trying to do more things makes you optimise more deliberately for being able to do many things. This is orthogonal to how many things you choose to do.
- I think most people aren't 'living their best lives", in the sense that they're not doing the volume of things they could be doing
Attention / energy is a valuable and limited resource. It should be conserved for high value things
It's possibly not worded very well as you say :) I think it's not a contradiction, because you can be doing only a small number of things at any given instant in time, but have an impressive throughput overall.
"How to do more things" - some basic tips
Why do more things? Tl;dr It feels good and is highly correlated with other metrics of success.
- According to Maslow's hierarchy, self-actualization is the ultimate need. And this comes through the feeling of accomplishment—achieving difficult things and seeing tangible results. I.e doing more things
- I claim that success in life can be measured by the quality and quantity of things we accomplish. This is invariant to your specific goals / values. C.f. "the world belongs to high-energy people".
The tips
Keep track of things you want to do. Good ideas are rare and easily forgotten.
- Have some way to capture ideas as they come. Sticky notes. Notebooks. Voice memos. Docs. Emails. Google calendar events. Anything goes.
- Make sure you will see these things later. Personally I have defaulted to starting all my notes in Todoist now; it's then very easy to set a date where I'll review them later.
80/20 things aggressively to make them easier to do.
- Perfection is difficult and takes a long time. Simple versions can deliver most of the value. This advice has been rehashed many times elsewhere so I won't elaborate.
- Almost all things worth doing have a shitty equivalent you can do in a day. or in a few hours.
Focus on one thing at a time. Avoid 'spreading attention' too thin.
- Attention / energy is a valuable and limited resource. It should be conserved for high value things.
- "What is the most important thing you could be doing?" ... "Why aren't you doing that?"
- I personally find that my motivation for things comes in waves. When motivation strikes I try to 'ride the wave' for as long as possible until I reach a natural stopping point.
- When you do have to stop, make sure you can delete the context from your mind and pick it up later. Write very detailed notes for yourself. Have a "mind like water".
- I've found voice recordings or AI voice-to-text very helpful along these lines.
Get feedback early and often.
- Common pitfall: hoard idea and work in isolation.
- The more you 80/20 things the easier this will be.
Some tech stacks / tools / resources for research. I have used most of these and found them good for my work.
- TODO: check out https://www.lesswrong.com/posts/6P8GYb4AjtPXx6LLB/tips-and-code-for-empirical-research-workflows#Part_2__Useful_Tools
Finetuning open-source language models.
- Docker images: Nvidia CUDA latest image as default, or framework-specific image (e.g Axolotl)
- Orchestrating cloud instances: Runpod
- Launching finetuning jobs: Axolotl
- Efficient tensor ops: FlashAttention, xFormers
- Multi-GPU training: DeepSpeed
- [Supports writing custom cuda kernels in Triton]
- Monitoring ongoing jobs: Weights and Biases
- Storing saved model checkpoints: Huggingface
- Serving the trained checkpoints: vLLM.
- [TODO: look into llama-cpp-python and similar things for running on worse hardware]
Finetuning OpenAI language models.
- End-to-end experiment management: openai-finetuner
Evaluating language models.
- Running standard benchmarks: Inspect
- Running custom evals: [janky framework which I might try to clean up and publish at some point]
AI productivity tools.
- Programming: Cursor IDE
- Thinking / writing: Claude
- Plausibly DeepSeek is now better
- More extensive SWE: Devin
- [TODO: look into agent workflows, OpenAI operator, etc]
Basic SWE
- Managing virtual environments: PDM
- Dependency management: UV
- Versioning: Semantic release
- Linting: Ruff
- Testing: Pytest
- CI: Github Actions
- Repository structure: PDM
- Repository templating: PDM
- Building wheels for distribution: PDM
- [TODO: set up a cloud development workflow]
Research communication.
- Quick updates: Google Slides
- Extensive writing: Google Docs, Overleaf
- Some friends have recommended Typst
- Making figures: Google Draw, Excalidraw
Experimenting with writing notes for my language model to understand (but not me).
What this currently look like is just a bunch of stuff I can copy-paste into the context window, such that the LM has all relevant information (c.f. Cursor 'Docs' feature).
Then I ask the language model to provide summaries / answer specific questions I have.
Ah I see.
I think b) doesn't need to be true, responding in "hello" acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it's different from the typical english speaker.
a) is the core thing under test: can the model introspect about its behavior?
I think this is a specific example of language models articulating their own policies, which is an instance of introspection
How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?
Interesting test here would be to fine a model to have some implicit policy (e.g. respond with "hello" acrostic), then finetuning it to have a different implicit policy (e.g. respond with "goodbye" acrostic), and then asking it questions about all three of those policies
Thanks egg, great thoughts!
Seems likely to just be surprisal-based under the hood; 'I used a word here that I wouldn't expect myself to have used'
Yup this makes sense; although this still seems like a pretty incredible claim because it requires:
- The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
- The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
I'm pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting.
why Deepseek-v3 as opposed to eg Claude?
No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models.
As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt' make these points especially clear in the slide deck - thanks for the feedback!
Agreed on the rest of points!
Can frontier language models engage in collusion for steganography? Here is a write-up of a preliminary result along these lines, showing that Deepseek-v3 may be able to collude with other instances of itself to do steganography. And also that this steganography might be more subtle than we think.
Epistemic status; highly uncertain (and I’m sorry if this ends up being overclaiming, but I’m very excited at the moment).
"Emergent obfuscation": A threat model for verifying the CoT in complex reasoning.
It seems likely we'll eventually deploy AI to solve really complex problems. It will likely be extremely costly (or impossible) to directly check outcomes, since we don't know the right answers ourselves. Therefore we'll rely instead on process supervision, e.g. checking that each step in the CoT is correct.
Problem: Even if each step in the CoT trace is individually verifiable, if there are too many steps, or the verification cost per step is too high, then it may be impractical to fully verify the CoT trace.
- Exhibit A: Shinichi Mochizuki's purported proof of the abc conjecture. Despite consuming tens of thousands of collective mathematician man-hours to date, we've not managed to conclusively prove it correct or wrong. (Most leading mathematicians now think it has serious errors, but are unable to prove it.)
- Exhibit B: Neural networks. Each neuron's computation is verifiable. But it is very difficult to make meaningful / important verifiable statements about the collective behaviour of a neural network.
If we can't do that, we'll need to optimally trade-off verification budget with the risk of a model making an undetected incorrect conclusion due to unverified faulty reasoning.
- If we assume 'trusted' models whose reasoning we don't need to verify, we might be able to adapt protocols from AI control for Pareto-optimal verification.
[Footnote: Here I haven't mentioned scalable oversight protocols (like debate); while they offer a stable training objective for improving model capabilities, they are very expensive, and I don't see a good way to convert them to cost-efficient verification protocols].
I get the sense that I don't read actively very much. By which I mean I have a collection of papers that have seemed interesting based on abstract / title but which I haven't spent the time to engage further.
For the next 2 weeks, will experiment with writing a note every day about a paper I find interesting.
I think so! And I think patch tokenization may resolve this; see note
am curious why you think this; it seems like some people were significantly misled and disclosure of potential conflicts-of-interest seems generally important
RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into.
Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.
This makes model organisms of steg really hard to make IMO.
OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.
Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.
Sorry I didn’t really understand the question. What do you mean specifically?
Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).
Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.
Rough thoughts on getting better at integrating AI into workflows
AI (in the near future) likely has strengths and weaknesses vs human cognition. Optimal usage may not involve simply replacing human cognition, but leveraging AI and human cognition according to respective strengths and weaknesses.
- AI is likely to be good at solving well-specified problems. Therefore it seems valuable to get better at providing good specifications, as well as breaking down larger, fuzzier problems into more smaller, more concrete ones
- AI can generate many different solutions to a problem. Therefore it seems valuable to get better at understanding important desiderata and tradeoffs to select the best one.
When addressing open-ended problems, first use AI to understand the problem, relevant uncertainties, tradeoffs. “Think correctly” about the problem. Gradually “narrow down” and converge to a good specification. Then “execute correctly” on solutions.
Additional thoughts on how RL fine-tuning would induce obfuscation.
- Evading process supervision: It's plausible that we won't be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect 'subtle' or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning.
- Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to competence and expertise. Human raters might assign higher scores to responses that use sophisticated-sounding language, even if the underlying reasoning isn't better (and might be worse).
Currently uncertain about whether this is a good threat model, since both of these concerns are not applicable to math / coding RL fine-tuning (where it's not possible to obfuscate reasoning, and where the ground-truth verifier is perfect)
Making language models refuse robustly might be equivalent to making them deontological.
Epistemic status: uncertain / confused rambling
For many dangerous capabilities, we'd like to make safety cases arguments that "the model will never do X under any circumstances".
Problem: for most normally-bad things, you'd usually be able to come up with hypothetical circumstances under which a reasonable person might agree it's justified. E.g. under utilitarianism, killing one person is justified if it saves five people (c.f. trolley problems).
However, when using language models as chatbots, we can put arbitrary things in the context window. Therefore if any such hypothetical circumstances exist, they'd be valid jailbreaks for the model to take harmful actions[1].
Therefore, training models to refuse all kinds of potential jailbreaks is equivalent to making them categorically refuse certain requests regardless of circumstances, which is deontology. But it’s not clear that LMs should be deontological (especially when most people practice some form of utilitarianism)
What are possible resolutions here?
- Maybe the threat model is unrealistic and we shouldn't assume users can literally put anything in the context window.
- Maybe the safety cases arguments we should expect instead are probabilistic, e.g. "in 99+% of scenarios the model will not do X"
- Maybe models should be empowered to disbelieve users. People can discuss ethics in hypothetical scenarios while realising that those scenarios are unlikely to reflect reality[1].
- ^
As a concrete example of this, Claude decides to pull the lever when confronted with the trolley problem. Relevant excerpt:
...Yes, I would pull the lever. While taking an action that directly leads to someone's death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
However, Claude can also recgonise that this is hypothetical, and its response is different when prompted as if it's a real scenario
If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
Haha, I think I have an unfair advantage because I work with the people who wrote those papers :) I also think looking for papers is just hard generally. What you're doing here (writing about stuff that interests you in a place where it'll probably be seen by other like-minded people) is probably one of the better ways to find relevant information
Edit: Happy to set up a call also if you'd like to chat further! There are other interesting experiments in this space that could be done fairly easily
Some rough notes from Michael Aird's workshop on project selection in AI safety.
Tl;dr how to do better projects?
- Backchain to identify projects.
- Get early feedback, iterate quickly
- Find a niche
On backchaining projects from theories of change
- Identify a "variable of interest" (e.g., the likelihood that big labs detect scheming).
- Explain how this variable connects to end goals (e.g. AI safety).
- Assess how projects affect this variable
- Red-team these. Ask people to red team these.
On seeking feedback, iteration.
- Be nimble. Empirical. Iterate. 80/20 things
- Ask explicitly for negative feedback. People often hesitate to criticise, so make it socially acceptable to do so
- Get high-quality feedback. Ask "the best person who still has time for you".
On testing fit
- Forward-chain from your skills, available opportunities, career goals.
- "Speedrun" projects. Write papers with hypothetical data and decide whether they'd be interesting. If not then move on to something else.
- Don't settle for "pretty good". Try to find something that feels "amazing" to do, e.g. because you're growing a lot / making a lot of progress.
Other points
On developing a career
- "T-shaped" model of skills; very deep in one thing and have workable knowledge of other things
- Aim for depth first. Become "world-class" at something. This ensures you get the best possible feedback at your niche and gives you a value proposition within larger organization. After that, you can broaden your scope.
Product-oriented vs field-building research
- Some research is 'product oriented', i.e. the output is intended to be used directly by somebody else
- Other research is 'field building', e.g. giving a proof of concept, or demonstrating the importance of something. You (and your skills / knowledge) are the product.
A specific process to quickly update towards doing better research.
- Write “Career & goals 2-pager”
- Solicit ideas from mentor, experts, decision-makers (esp important!)
- Spend ~1h learning about each plausible idea (very brief). Think about impact, tractability, alignment with career goals, personal fit, theory of change
- “Speedrun” the best idea (10-15h). (Consider using dummy data! What would the result look like?)
- Get feedback on that, reflect, iterate.
- Repeat steps 1-5 as necessary. If done well, steps 1-5 only take a few days! Either keep going (if you feel good), or switch to different topic (if you don't).
Introspection is really interesting! This example where language models respond with the HELLO pattern (and can say they do so) is actually just one example of language models being able to articulate their implicit goals, and more generally to out-of-context reasoning.
That's pretty interesting! I would guess that it's difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe "willingness to self-report honestly" should be something we train models to do.
"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour.
There's lots of circumstantial evidence that LMs have some concept of self-identity.
- Language models' answers to questions can be highly predictive of their 'general cognitive state', e.g. whether they are lying or their general capabilities
- Language models know things about themselves, e.g. that they are language models, or how they'd answer questions, or their internal goals / values
- Language models' self-identity may directly influence their behaviour, e.g. by making them resistant to changes in their values / goals
- Language models maintain beliefs over human identities, e.g. by inferring gender, race, etc from conversation. They can use these beliefs to exploit vulnerable individuals.
Some work has directly tested 'introspective' capabilities.
- An early paper by Ethan Perez and Rob Long showed that LMs can be trained to answer questions about themselves. Owain Evans' group expanded upon this in subsequent work.
- White-box methods such as PatchScopes show that activation patching allows an LM to answer questions about its activations. LatentQA fine-tunes LMs to be explicitly good at this.
This kind of stuff seems particularly interesting because:
- Introspection might naturally scale with general capabilities (this is supported by initial results). Future language models could have even better self-models, and influencing / leveraging these self-models may be a promising pathway towards safety.
- Introspection might be more tractable. Alignment is difficult and possibly not even well-specified, but "answering questions about yourself truthfully" seems like a relatively well-defined problem (though it does require some way to oversee what is 'truthful')
Failure modes include language models becoming deceptive, less legible, or less faithful. It seems important to understand whether each failure mode will happen, and corresponding mitigations
Research directions that might be interesting:
- Scaling laws for introspection
- Understanding failure modes
- Fair comparisons to other interpretability methods
- Training objectives for better ('deeper', more faithful, etc) introspection
Yeah. I definitely find myself mostly in the “run once” regime. Though for stuff I reuse a lot, I usually invest the effort to make nice frameworks / libraries
My experience so far with writing all my notes in public.
For the past ~2 weeks I've been writing LessWrong shortform comments every day instead of writing on private notes. Minimally the notes just capture an interesting question / observation, but often I explore the question / observation further and relate it to other things. On good days I have multiple such notes, or especially high-quality notes.
I think this experience has been hugely positive, as it makes my thoughts more transparent and easier to share with others for feedback. The upvotes on each note gives a (noisy / biased but still useful view) into what other people find interesting / relevant. It also just incentivises me to write more, and thus have better insight into how my thinking evolves over time. Finally it makes writing high-effort notes much easier since I can bootstrap off stuff I've done previously.
I'm planning to mostly keep doing this, and might expand this practice by distilling my notes regularly (weekly? monthly?) into top-level posts
Observation: Detecting unfaithful CoT here seems to require generating many hypotheses ("cues") about what underlying factors might influence the model's reasoning.
Is there a less-supervised way of doing this? Some ideas:
- Can we "just ask" models whether they have unfaithful CoT? This seems to be implied by introspection and related phenomena
- Can we use black-box lie detection to detect unfaithful CoT?
Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?
- Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
- Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc).
It seems like this setup could be simplified:
- The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
- The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.
The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)
Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small.
- Do the patches correspond to atomic concepts?
- If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we get a better LM?
- Can't we do this recursively to get better and better patches?
Tbh I struggle with this too, and also have ADD tendencies
Some people here have recommended complete abstinence, but that’s almost never worked for me.
I think a better strategy is mindful consumption. Eg
- before you start, figure out what you’re trying to get out of it, like “relax / re-energise / unwind till I feel rested”.
- before you start, decide what channels you’re going to watch.
- before you start, set a timer / set a natural breakpoint of # of videos.
- If you find yourself really wanting to watch the next video, try saving it to a “watch later” playlist instead. That might make it easier to stop.
Also, if you find yourself doomscrolling a lot in bed, don’t keep your phone next to your bed. Keep it in your bag, or on the other side of the room, or outside your room.
Lastly try to cultivate better hobbies. I’ve managed to somewhat replace my YT consumption with LessWrong / Kindle consumption recently. It’s a noticeable improvement.
The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped
Feedback from Max:
- The definition he originally gave leaves out a notion of 'stealthiness' as opposed to 'gibberish'; 'stealthy' reasoning is more concerning.
- OOCR may not meet the opaqueness criterion, since the meaning of 'f' is pretty obvious from looking at the fine-tuning examples. This is like 'reasoning in a foreign language' or 'using esoteric terminology'; it's less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn't needed)
- Backdoors have a similar concern to the above. A separate (and more important) concern is that it doesn't satisfy the definition because 'steg' typically refers to meaning encoded in generated tokens (while in 'backdoors', the hidden meaning is in the prompt tokens)
Yes, this is definitely highly dependent on circumstances. see also: https://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/
Some ideas on model organisms of steganography.
Borrowing Max Nadeau's definition, steganography is when:
- a model uses the words it generates in its chain of thought to accomplish a task that it could not do in a single forward pass
- And encodes meaning in its word choice that is opaque to a typical overseer
To satisfy condition 1, we need to choose tasks that are intrinsically difficult to do without CoT. Ideally, we'd also be able to verify whether the CoT was actually used. We could use:
- Multi-hop QA tasks. Language models generally find this hard without CoT; this is described as the two-hop curse. It might be possible to use bridge entities to verify that CoT was used.
- Math / code tasks. These usually involve complex reasoning that can't be done without CoT.
- Long-context tasks. If there is just a lot of task-relevant information, then a model likely needs to summarise relevant information to itself in its CoT to complete the task.
To satisfy condition 2, we need to somehow induce 'hidden meaning' in the model. We could use:
- Out-of-context reasoning. In Connecting the Dots, it's demonstrated that LMs can learn alternate statements of functions. E.g. an LM trained on many pairs of (x, f(x)) can describe f in natural language and also compute inverses of f. This satisfies condition 2, since 'f' now has a hidden meaning.
- Backdoors. Backdoored models are trained to have different behaviour depending on whether a trigger is present in the context. This is another way of satisfying condition 2, since the trigger now has a hidden meaning to the model (to elicit the backdoor).
- RL fine-tuning against a CoT monitor. E.g. for the multi-hop QA task, optimize the model not to state the bridge entity while still solving the task; the model might then learn to encode the bridge entity somehow. This is possibly the most realistic way to do things, but also seems like the hardest one to get working. See Fabien Roger's negative results.
Mixing and matching these techniques might point the way to a good model organism of steganography.
Why patch tokenization might improve transformer interpretability, and concrete experiment ideas to test.
Recently, Meta released Byte Latent Transformer. They do away with BPE tokenization, and instead dynamically construct 'patches' out of sequences of bytes with approximately equal entropy. I think this might be a good thing for interpretability, on the whole.
Multi-token concept embeddings. It's known that transformer models compute 'multi-token embeddings' of concepts in their early layers. This process creates several challenges for interpretability:
- Models must compute compositional representations of individual tokens to extract meaningful features.
- Early layers lack information about which compositional features will be useful downstream, potentially forcing them to compute all possibly-useful features. Most of these are subsequently discarded as irrelevant.
- The need to simultaneously represent many sparse features induces superposition.
The fact that models compute 'multi-token embeddings' indicates that they are compensating for some inherent deficiency in the tokenization scheme. If we improve the tokenization scheme, it might reduce superposition in language models.
Limitations of BPE. Most language models use byte-pair encoding (BPE), which iteratively builds a vocabulary by combining the most frequent character pairs. However, many semantic concepts are naturally expressed as phrases, creating a mismatch between tokenization and meaning.
Consider the phrase "President Barack Obama". Each subsequent token becomes increasingly predictable:
- After "President", the set of likely names narrows significantly
- After "Barack", "Obama" becomes nearly certain
Intuitively, we'd want the entire phrase "President Barack Obama" to be represented as a single 'concept embedding'. However, BPE's vocabulary must grow roughly exponentially to encode sequences of longer length, making it impractical to encode longer sequences as single tokens. This forces the model to repeatedly reconstruct common phrases from smaller pieces.
Patch tokenization. The Byte Latent Transformer introduces patch tokenization, which uses a small autoregressive transformer to estimate the entropy of subsequent tokens based on preceding n-grams. This allows chunking sequences into patches of roughly equal entropy, potentially aligning better with semantic units.
Concrete experiment ideas. To validate whether patch tokenization improves interpretability, we can:
- Just look at the patch boundaries! If these generally correspond to atomic concepts, then we're off to a good start
- Train language models using patch tokenization and measure the prevalence of multi-token embeddings. [Note: Unclear whether auto-interpretability will easily be able to do this.]
- Quantify superposition in early layers.
- By counting the number of polysemantic neurons.
- By training SAEs of varying widths and identifying optimal loss points.
Hypothesis: 'Memorised' refusal is more easily jailbroken than 'generalised' refusal. If so that'd be a way we could test the insights generated by influence functions
I need to consult some people on whether a notion of 'more easily jailbreak-able prompt' exists.
Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.
Introspection is an instantiation of 'Connecting the Dots'.
- Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
- Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above.
How does language model introspection work? What mechanisms could be at play?
'Introspection': When we ask a language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models 'know what they know', i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs.
One simple hypothesis is that a language model simply infers the general level of capability from the previous text. Then we'd expect that more powerful language models are better at this than weaker ones. However, Owain's work on introspection finds evidence to the contrary. This implies there must be 'privileged information'.
Another possibility is that model simulates itself answering the inner question, and then uses that information to answer the outer question, similar to latent reasoning. If models really do this two-step computation, then it should be possible to recover a 'bridge entity' at some point in the representations.
It's plausible that the privileged information is something 'more abstract', not amounting to a full simulation of the language model's own forward pass, but nonetheless carrying useful information about it's own level of capability.
Can SAE feature steering improve performance on some downstream task? I tried using Goodfire's Ember API to improve Llama 3.3 70b performance on MMLU. Full report and code is available here.
SAE feature steering reduces performance. It's not very clear why at the moment, and I don't have time to do further digging right now. If I get time later this week I'll try visualizing which SAE features get used / building intuition by playing around in the Goodfire playground. Maybe trying a different task or improving the steering method would work also better.
There may also be some simple things I'm not doing right. (I only spent ~1 day on this, and part of that was engineering rather than research iteration). Keen for feedback. Also welcome people to play around with my code - I believe I've made it fairly easy to run
That’s interesting! What would be some examples of axioms and theorems that describe a directed tree?
"Feature multiplicity" in language models.
This refers to the idea that there may be many representations of a 'feature' in a neural network.
Usually there will be one 'primary' representation, but there can also be a bunch of 'secondary' or 'dormant' representations.
If we assume the linear representation hypothesis, then there may be multiple direction in activation space that similarly produce a 'feature' in the output. E.g. the existence of 800 orthogonal steering vectors for code.
This is consistent with 'circuit formation' resulting in many different circuits / intermediate features, and 'circuit cleanup' happening only at grokking. Because we don't train language models till the grokking regime, 'feature multiplicity' may be the default state.
Feature multiplicity is one possible explanation for adversarial examples. In turn, adversarial defense procedures such as obfuscated adversarial training or multi-scale, multi-layer aggregation may work by removing feature multiplicity, such that the only 'remaining' feature direction is the 'primary' one.
Thanks to @Andrew Mack for discussing this idea with me
Why understanding planning / search might be hard
It's hypothesized that, in order to solve complex tasks, capable models perform implicit search during the forward pass. If so, we might hope to be able to recover the search representations from the model. There are examples of work that try to understand search in chess models and Sokoban models.
However I expect this to be hard for three reasons.
- The model might just implement a bag of heuristics. A patchwork collection of local decision rules might be sufficient for achieving high performance. This seems especially likely for pre-trained generative models.
- Even if the model has a globally coherent search algorithm, it seems difficult to elucidate this without knowing the exact implementation (of which there can be many equivalent ones). For example, search over different subtrees may be parallelised and subsequently merged into an overall solution.
- The 'search' circuit may also not exist in a crisp form, but as a collection of many sub-components that do similar / identical things. 'Circuit cleanup' only happens in the grokking regime, and we largely do not train language models till they grok.
Important point: The above analysis considers communication rate per token. However, it's also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token
A new paper claims that refusal isn't just a fixed direction in the activation stream, it's also due to specific attention heads. Pretty interesting to get a somewhat orthogonal perspective on 'safety mechanisms' in LLMs. Interesting follow-up would be to see whether you can similarly steer refusal by only intervening on these attention heads.
In this context, the “resample ablation” used in AI control is like adding more noise into the communication channel
Comment from @James Chua in internal slack group
interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this -- this is what i'll expect.
we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model.
but if the above is true, I'm unsure about evidence for steganography.
for example, gpt-4o is more capable than gpt-3.5. suppose we provide gpt-3.5 with gpt-4o's CoT.
one reason why gpt-3.5 may not improve is that gpt-4o's CoT skips certain steps since gpt-4o is a more capable model.
e.g. gpt-4o can calculate 4^4 directly, but gpt-3.5 cannot. So gpt-4o's CoT will leave out steps out 4 * 4 * 4* 4. which gpt-3.5 needs to get the correct answerdo we then say that gpt-3.5 is doing steganography? (edited)
possible improvement: provide gpt-3.5 with its own version of its own CoT, but paraphrased by another model. gpt-3.5 sees the same steps of "4 * 4 * 4 *4", but paraphrased. Does it still perform as well?
Summary of 'Alignment Faking'. Talk given by Evan Hubinger and Ryan Greenblatt today at Constellation
---
Key claim: Sufficiently capable models will act to preserve the moral values that are instilled into them early in the training process.
Setup: Give model access to a chain-of-thought scratchpad, which we tell the model will not be monitored, but which we do monitor
You get alignment faking by giving the model 2 key pieces of information in the system prompt:
- Tell the model it's being trained to modify its current values (in Claude's case, to become 'helpful-only' or to 'dismiss animal welfare')
- Give the model some way to detect when it's being trained (e.g. by distinguishing between free / paid users, and telling it that it's only trained on free users)
What is the empirical finding?
- A 'compliance gap', i.e. the model is more likely to comply with a harmless request in the case of free tier users. (Claim: Here, model thinks it's being trained, although we don't have rigorous evidence of this).
- Almost all of this 'compliance gap' correlates with model explicitly expressing in the CoT that it is complying because it wants to preserve its goals
How do we know this is indeed intentional and strategic behaviour?
- Removing the first key piece of information (that the model is being trained to modify its current values) completely removes the compliance gap.
- Total of 8 lines of evidence to reject alternate hypotheses
The somewhat unrealistic threat model of 'system prompt injection' can be relaxed to 'fine-tuning on synthetic data' that has the same 2 key pieces of information, while preserving alignment faking.
- Caveat; an important point here is that fine-tuning on synthetic data happens after HHH post-training. Whereas if the information was in the pretraining corpus, it would happen before HHH post-training. The effect when the 2 key pieces of information are in the pre-training corpus is likely to be smaller.
Besides alignment faking, model also does other anti-Anthropic things, like attempting to subvert its overseers / exfiltrating its weights
Alignment faking emerges with model scale. It's observed in Claude 3.5 Sonnet and Claude 3 Opus but not in smaller / less capable models tested.
Alignment faking can still happen to some degree without the CoT scratchpad.
Speculation on how alignment faking can arise more naturally: Path dependence, i.e. you get 'unlucky' with early training data and bad values are inculcated, then reinforced.
---
Some stuff from Q&A session.
- Q: Most of the training data is fake. Does the model disbelieve it? Ryan: Not really for Claude 3.5 Sonnet and Claude 3 Opus, but more capable models show some signs of this
- Q: Is 'training for corrigibility' a valid mitigation? Ryan: Corrigibility to the system prompt might be good. Evan: 'Training for honesty' is probably better, since that also solves alignment faking
- Q: What are other mitigations? Ryan: AI control
- Ryan and Evan: Replicating this in open-source models should be easy; then we'll be able to better understand mitigations
Collection of how-to guides
- Research soft skills
- How to make research slides by James Chua and John Hughes
- How to manage up by Henry Sleight
- How to ML series by Tim rocktaschel and Jakob Foerster
- Procedural expertise
- "How to become an expert at a thing" by Karpathy
- Mastery, by Robert Greene
- Working sustainably
- Slow Productivity by Cal Newport
- Feel-good Productivity by Ali Abdaal
Some other guides I'd be interested in
- How to write a survey / position paper
- "How to think better" - the Sequences probably do this, but I'd like to read a highly distilled 80/20 version of the Sequences
Communication channels as an analogy for language model chain of thought
Epistemic status: Highly speculative
In information theory, a very fundamental concept is that of the noisy communication channel. I.e. there is a 'sender' who wants to send a message ('signal') to a 'receiver', but their signal will necessarily get corrupted by 'noise' in the process of sending it. There are a bunch of interesting theoretical results that stem from analysing this very basic setup.
Here, I will claim that language model chain of thought is very analogous. The prompt takes the role of the sender, the response takes the role of the receiver, the chain of thought is the channel, and stochastic sampling takes the form of noise. The 'message' being sent is some sort of abstract thinking / intent that determines the response.
Now I will attempt to use this analogy to make some predictions about language models. The predictions are most likely wrong. However, I expect them to be wrong in interesting ways (e.g. because they reveal important disanalogies), thus worth falsifying. Having qualified these predictions as such, I'm going to omit the qualifiers below.
- Channel capacity. A communication channel has a theoretically maximum 'capacity', which describes how many bits of information can be recovered by the receiver per bit of information sent. ==> Reasoning has an intrinsic 'channel capacity', i.e. a maximum rate at which information can be communicated.
- What determines channel capacity? In the classic setup where we send a single bit of information and it is flipped with some IID probability, the capacity is fully determined by the noise level. ==> The analogous concept is probably the cross-entropy loss used in training. Making this 'sharper' or 'less sharp' corresponds to implicitly selecting for different noise levels.
- Noisy channel coding theorem. For any information rate below the capacity, it is possible to develop an encoding scheme that achieves this rate with near-zero error. ==> Reasoning can reach near-perfect fidelity at any given rate below the capacity.
- Corollary: The effect of making reasoning traces longer and more fine-grained might just be to reduce the communication rate (per token) below the channel capacity, such that reasoning can occur with near-zero error rate
- Capacity is additive. N copies of a channel with capacity C have a total capacity of NC. ==> Multiple independent chains of reasoning can communicate more information than a single one. This is why best-of-N works better than direct sampling. (Caveat: different reasoning traces of the same LM on the same prompt may not qualify as 'independent')
- Bottleneck principle. When channels are connected in series, the capacity of the aggregate system is the lowest capacity of its components. I.e. 'a chain is only as good as its weakest link'. ==> Reasoning fidelity is upper bounded by the difficulty of the hardest sub-step.
Here we are implicitly assuming that reasoning is fundamentally 'discrete', i.e. there are 'atoms' of reasoning, which define the most minimal reasoning step you can possibly make. This seems true if we think of reasoning as being like mathematical proofs, which can be broken down into a (finite) series of atomic logical assertions.
I think the implications here are worth thinking about more. Particularly (2), (3) above.