Posts

In the future, language models will be our interface to the world 2025-01-24T23:16:49.999Z
A Collection of Empirical Frames about Language Models 2025-01-02T02:49:05.965Z
Why I'm Moving from Mechanistic to Prosaic Interpretability 2024-12-30T06:35:43.417Z
A Sober Look at Steering Vectors for LLMs 2024-11-23T17:30:00.745Z
Evolutionary prompt optimization for SAE feature visualization 2024-11-14T13:06:49.728Z
An Interpretability Illusion from Population Statistics in Causal Analysis 2024-07-29T14:50:19.497Z
Daniel Tan's Shortform 2024-07-17T06:38:07.166Z
Mech Interp Lacks Good Paradigms 2024-07-16T15:47:32.171Z
Activation Pattern SVD: A proposal for SAE Interpretability 2024-06-28T22:12:48.789Z

Comments

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-25T00:01:36.174Z · LW · GW

Yes, this is true - law of reversing advice holds. But I think two other things are true: 

  1. Intentionally trying to do more things makes you optimise more deliberately for being able to do many things. This is orthogonal to how many things you choose to do.
  2. I think most people aren't 'living their best lives", in the sense that they're not doing the volume of things they could be doing

Attention / energy is a valuable and limited resource. It should be conserved for high value things

It's possibly not worded very well as you say :) I think it's not a contradiction, because you can be doing only a small number of things at any given instant in time, but have an impressive throughput overall. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-24T22:54:58.506Z · LW · GW

"How to do more things" - some basic tips 

Why do more things? Tl;dr It feels good and is highly correlated with other metrics of success. 

  • According to Maslow's hierarchy, self-actualization is the ultimate need. And this comes through the feeling of accomplishment—achieving difficult things and seeing tangible results. I.e doing more things
  • I claim that success in life can be measured by the quality and quantity of things we accomplish. This is invariant to your specific goals / values. C.f. "the world belongs to high-energy people". 

The tips

Keep track of things you want to do. Good ideas are rare and easily forgotten. 

  • Have some way to capture ideas as they come. Sticky notes. Notebooks. Voice memos. Docs. Emails. Google calendar events. Anything goes.
  • Make sure you will see these things later. Personally I have defaulted to starting all my notes in Todoist now; it's then very easy to set a date where I'll review them later. 

80/20 things aggressively to make them easier to do. 

  • Perfection is difficult and takes a long time. Simple versions can deliver most of the value. This advice has been rehashed many times elsewhere so I won't elaborate.
  • Almost all things worth doing have a shitty equivalent you can do in a day. or in a few hours. 

Focus on one thing at a time. Avoid 'spreading attention' too thin. 

  • Attention / energy is a valuable and limited resource. It should be conserved for high value things.
  • "What is the most important thing you could be doing?" ... "Why aren't you doing that?"
  • I personally find that my motivation for things comes in waves. When motivation strikes I try to 'ride the wave' for as long as possible until I reach a natural stopping point.
  • When you do have to stop, make sure you can delete the context from your mind and pick it up later. Write very detailed notes for yourself. Have a "mind like water".
  • I've found voice recordings or AI voice-to-text very helpful along these lines.

Get feedback early and often. 

  • Common pitfall: hoard idea and work in isolation.
  • The more you 80/20 things the easier this will be. 
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-24T18:08:08.524Z · LW · GW

Some tech stacks / tools / resources for research. I have used most of these and found them good for my work. 

Finetuning open-source language models. 

  • Docker images: Nvidia CUDA latest image as default, or framework-specific image (e.g Axolotl)
  • Orchestrating cloud instances: Runpod
    • Connecting to cloud instances: Paramiko
    • Transferring data: SCP
  • Launching finetuning jobs: Axolotl
  • Monitoring ongoing jobs: Weights and Biases
  • Storing saved model checkpoints: Huggingface
  • Serving the trained checkpoints: vLLM.
    • [TODO: look into llama-cpp-python and similar things for running on worse hardware] 

Finetuning OpenAI language models. 

Evaluating language models. 

  • Running standard benchmarks: Inspect
  • Running custom evals: [janky framework which I might try to clean up and publish at some point] 

AI productivity tools.

  • Programming: Cursor IDE
  • Thinking / writing: Claude
    • Plausibly DeepSeek is now better
  • More extensive SWE: Devin
  • [TODO: look into agent workflows, OpenAI operator, etc] 

Basic SWE

  • Managing virtual environments: PDM
  • Dependency management: UV
  • Versioning: Semantic release
  • Linting: Ruff
  • Testing: Pytest
  • CI: Github Actions
  • Repository structure: PDM
  • Repository templating: PDM
  • Building wheels for distribution: PDM
  • [TODO: set up a cloud development workflow] 

Research communication. 

  • Quick updates: Google Slides
  • Extensive writing: Google Docs, Overleaf
    • Some friends have recommended Typst
  • Making figures: Google Draw, Excalidraw
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-24T17:23:01.359Z · LW · GW

Experimenting with writing notes for my language model to understand (but not me). 

What this currently look like is just a bunch of stuff I can copy-paste into the context window, such that the LM has all relevant information (c.f. Cursor 'Docs' feature). 

Then I ask the language model to provide summaries / answer specific questions I have. 

Comment by Daniel Tan (dtch1997) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-23T15:23:18.801Z · LW · GW

Ah I see. 

I think b) doesn't need to be true, responding in "hello" acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it's different from the typical english speaker. 

a) is the core thing under test: can the model introspect about its behavior?

I think this is a specific example of language models articulating their own policies, which is an instance of introspection 

How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?

Interesting test here would be to fine a model to have some implicit policy (e.g. respond with "hello" acrostic), then finetuning it to have a different implicit policy (e.g. respond with "goodbye" acrostic), and then asking it questions about all three of those policies

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-23T14:48:14.924Z · LW · GW

Thanks egg, great thoughts! 

Seems likely to just be surprisal-based under the hood; 'I used a word here that I wouldn't expect myself to have used'

Yup this makes sense; although this still seems like a pretty incredible claim because it requires: 

  1. The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
  2. The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message. 

I'm pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting. 

why Deepseek-v3 as opposed to eg Claude?

No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-23T13:24:04.782Z · LW · GW

As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios. 

There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”. 

I didnt' make these points especially clear in the slide deck - thanks for the feedback! 

Agreed on the rest of points! 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-23T09:19:30.776Z · LW · GW

Can frontier language models engage in collusion for steganography? Here is a write-up of a preliminary result along these lines, showing that Deepseek-v3 may be able to collude with other instances of itself to do steganography. And also that this steganography might be more subtle than we think. 

Epistemic status; highly uncertain (and I’m sorry if this ends up being overclaiming, but I’m very excited at the moment).  

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-21T01:51:42.245Z · LW · GW

"Emergent obfuscation": A threat model for verifying the CoT in complex reasoning. 

It seems likely we'll eventually deploy AI to solve really complex problems. It will likely be extremely costly (or impossible) to directly check outcomes, since we don't know the right answers ourselves. Therefore we'll rely instead on process supervision, e.g. checking that each step in the CoT is correct.  

Problem: Even if each step in the CoT trace is individually verifiable, if there are too many steps, or the verification cost per step is too high, then it may be impractical to fully verify the CoT trace. 

  • Exhibit A: Shinichi Mochizuki's purported proof of the abc conjecture. Despite consuming tens of thousands of collective mathematician man-hours to date, we've not managed to conclusively prove it correct or wrong. (Most leading mathematicians now think it has serious errors, but are unable to prove it.)
  • Exhibit B: Neural networks. Each neuron's computation is verifiable. But it is very difficult to make meaningful / important verifiable statements about the collective behaviour of a neural network. 

If we can't do that, we'll need to optimally trade-off verification budget with the risk of a model making an undetected incorrect conclusion due to unverified faulty reasoning. 

  • If we assume 'trusted' models whose reasoning we don't need to verify, we might be able to adapt protocols from AI control for Pareto-optimal verification. 

[Footnote: Here I haven't mentioned scalable oversight protocols (like debate); while they offer a stable training objective for improving model capabilities, they are very expensive, and I don't see a good way to convert them to cost-efficient verification protocols]. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-20T18:11:45.241Z · LW · GW

I get the sense that I don't read actively very much. By which I mean I have a collection of papers that have seemed interesting based on abstract / title but which I haven't spent the time to engage further. 

For the next 2 weeks, will experiment with writing a note every day about a paper I find interesting. 

Comment by Daniel Tan (dtch1997) on Viliam's Shortform · 2025-01-20T07:28:31.691Z · LW · GW

I think so! And I think patch tokenization may resolve this; see note

Comment by Daniel Tan (dtch1997) on meemi's Shortform · 2025-01-20T03:57:42.232Z · LW · GW

am curious why you think this; it seems like some people were significantly misled and disclosure of potential conflicts-of-interest seems generally important

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-19T02:56:55.293Z · LW · GW

RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-19T02:55:10.575Z · LW · GW

Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended. 

This makes model organisms of steg really hard to make IMO. 

OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning. 

Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT. 

Comment by Daniel Tan (dtch1997) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-19T02:43:17.776Z · LW · GW

Sorry I didn’t really understand the question. What do you mean specifically?

Comment by Daniel Tan (dtch1997) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-19T02:41:57.988Z · LW · GW

Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography). 

Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-19T02:35:57.891Z · LW · GW

Rough thoughts on getting better at integrating AI into workflows

AI (in the near future) likely has strengths and weaknesses vs human cognition. Optimal usage may not involve simply replacing human cognition, but leveraging AI and human cognition according to respective strengths and weaknesses. 

  • AI is likely to be good at solving well-specified problems. Therefore it seems valuable to get better at providing good specifications, as well as breaking down larger, fuzzier problems into more smaller, more concrete ones
  • AI can generate many different solutions to a problem. Therefore it seems valuable to get better at understanding important desiderata and tradeoffs to select the best one. 

When addressing open-ended problems, first use AI to understand the problem, relevant uncertainties, tradeoffs. “Think correctly” about the problem. Gradually “narrow down” and converge to a good specification. Then “execute correctly” on solutions.  

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-19T01:01:43.305Z · LW · GW
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-19T00:59:54.354Z · LW · GW

Additional thoughts on how RL fine-tuning would induce obfuscation. 

  1. Evading process supervision: It's plausible that we won't be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect 'subtle' or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning.
  2. Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to competence and expertise. Human raters might assign higher scores to responses that use sophisticated-sounding language, even if the underlying reasoning isn't better (and might be worse).

Currently uncertain about whether this is a good threat model, since both of these concerns are not applicable to math / coding RL fine-tuning (where it's not possible to obfuscate reasoning, and where the ground-truth verifier is perfect) 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-18T20:22:52.529Z · LW · GW

Making language models refuse robustly might be equivalent to making them deontological. 

Epistemic status: uncertain / confused rambling

For many dangerous capabilities, we'd like to make safety cases arguments that "the model will never do X under any circumstances". 

Problem: for most normally-bad things, you'd usually be able to come up with hypothetical circumstances under which a reasonable person might agree it's justified. E.g. under utilitarianism, killing one person is justified if it saves five people (c.f. trolley problems). 

However, when using language models as chatbots, we can put arbitrary things in the context window. Therefore if any such hypothetical circumstances exist, they'd be valid jailbreaks for the model to take harmful actions[1]

Therefore, training models to refuse all kinds of potential jailbreaks is equivalent to making them categorically refuse certain requests regardless of circumstances, which is deontology. But it’s not clear that LMs should be deontological (especially when most people practice some form of utilitarianism) 

What are possible resolutions here? 

  1. Maybe the threat model is unrealistic and we shouldn't assume users can literally put anything in the context window.
  2. Maybe the safety cases arguments we should expect instead are probabilistic, e.g. "in 99+% of scenarios the model will not do X"
  3. Maybe models should be empowered to disbelieve users. People can discuss ethics in hypothetical scenarios while realising that those scenarios are unlikely to reflect reality[1]
  1. ^

    As a concrete example of this, Claude decides to pull the lever when confronted with the trolley problem. Relevant excerpt: 

    ...Yes, I would pull the lever. While taking an action that directly leads to someone's death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).

    However, Claude can also recgonise that this is hypothetical, and its response is different when prompted as if it's a real scenario

    If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.

Comment by Daniel Tan (dtch1997) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-17T23:45:17.492Z · LW · GW

Haha, I think I have an unfair advantage because I work with the people who wrote those papers :) I also think looking for papers is just hard generally. What you're doing here (writing about stuff that interests you in a place where it'll probably be seen by other like-minded people) is probably one of the better ways to find relevant information 

Edit: Happy to set up a call also if you'd like to chat further! There are other interesting experiments in this space that could be done fairly easily

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-17T23:02:00.885Z · LW · GW

Some rough notes from Michael Aird's workshop on project selection in AI safety. 

Tl;dr how to do better projects? 

  • Backchain to identify projects.
  • Get early feedback, iterate quickly
  • Find a niche

On backchaining projects from theories of change

  • Identify a "variable of interest"  (e.g., the likelihood that big labs detect scheming).
  • Explain how this variable connects to end goals (e.g. AI safety).
  • Assess how projects affect this variable
  • Red-team these. Ask people to red team these. 

On seeking feedback, iteration. 

  • Be nimble. Empirical. Iterate. 80/20 things
  • Ask explicitly for negative feedback. People often hesitate to criticise, so make it socially acceptable to do so
  • Get high-quality feedback. Ask "the best person who still has time for you". 

On testing fit

  • Forward-chain from your skills, available opportunities, career goals.
  • "Speedrun" projects. Write papers with hypothetical data and decide whether they'd be interesting. If not then move on to something else.
  • Don't settle for "pretty good". Try to find something that feels "amazing" to do, e.g. because you're growing a lot / making a lot of progress. 

Other points

On developing a career

  • "T-shaped" model of skills; very deep in one thing and have workable knowledge of other things
  • Aim for depth first. Become "world-class" at something. This ensures you get the best possible feedback at your niche and gives you a value proposition within larger organization. After that, you can broaden your scope. 

Product-oriented vs field-building research

  • Some research is 'product oriented', i.e. the output is intended to be used directly by somebody else
  • Other research is 'field building', e.g. giving a proof of concept, or demonstrating the importance of something. You (and your skills / knowledge) are the product. 

A specific process to quickly update towards doing better research. 

  1. Write “Career & goals 2-pager”
  2. Solicit ideas from mentor, experts, decision-makers (esp important!)
  3. Spend ~1h learning about each plausible idea (very brief). Think about impact, tractability, alignment with career goals, personal fit, theory of change
  4. “Speedrun” the best idea (10-15h). (Consider using dummy data! What would the result look like?)
  5. Get feedback on that, reflect, iterate.
  6. Repeat steps 1-5 as necessary. If done well, steps 1-5 only take a few days! Either keep going (if you feel good), or switch to different topic (if you don't). 
Comment by Daniel Tan (dtch1997) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-17T21:28:16.955Z · LW · GW

Introspection is really interesting! This example where language models respond with the HELLO pattern (and can say they do so) is actually just one example of language models being able to articulate their implicit goals, and more generally to out-of-context reasoning

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-17T21:23:09.223Z · LW · GW

That's pretty interesting! I would guess that it's difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe "willingness to self-report honestly" should be something we train models to do. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-17T19:27:45.324Z · LW · GW

"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour. 

There's lots of circumstantial evidence that LMs have some concept of self-identity. 

Some work has directly tested 'introspective' capabilities. 

This kind of stuff seems particularly interesting because:

  • Introspection might naturally scale with general capabilities (this is supported by initial results). Future language models could have even better self-models, and influencing / leveraging these self-models may be a promising pathway towards safety.
  • Introspection might be more tractable. Alignment is difficult and possibly not even well-specified, but "answering questions about yourself truthfully" seems like a relatively well-defined problem (though it does require some way to oversee what is 'truthful')  

Failure modes include language models becoming deceptive, less legible, or less faithful. It seems important to understand whether each failure mode will happen, and corresponding mitigations

Research directions that might be interesting: 

  • Scaling laws for introspection
  • Understanding failure modes
  • Fair comparisons to other interpretability methods
  • Training objectives for better ('deeper', more faithful, etc) introspection 
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-17T05:06:41.318Z · LW · GW

Yeah. I definitely find myself mostly in the “run once” regime. Though for stuff I reuse a lot, I usually invest the effort to make nice frameworks / libraries

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-15T18:18:24.007Z · LW · GW

My experience so far with writing all my notes in public

For the past ~2 weeks I've been writing LessWrong shortform comments every day instead of writing on private notes. Minimally the notes just capture an interesting question / observation, but often I explore the question / observation further and relate it to other things. On good days I have multiple such notes, or especially high-quality notes. 

I think this experience has been hugely positive, as it makes my thoughts more transparent and easier to share with others for feedback. The upvotes on each note gives a (noisy / biased but still useful view) into what other people find interesting / relevant. It also just incentivises me to write more, and thus have better insight into how my thinking evolves over time. Finally it makes writing high-effort notes much easier since I can bootstrap off stuff I've done previously. 

I'm planning to mostly keep doing this, and might expand this practice by distilling my notes regularly (weekly? monthly?) into top-level posts

Comment by Daniel Tan (dtch1997) on Inference-Time-Compute: More Faithful? A Research Note · 2025-01-15T17:14:38.154Z · LW · GW

Observation: Detecting unfaithful CoT here seems to require generating many hypotheses ("cues") about what underlying factors might influence the model's reasoning. 

Is there a less-supervised way of doing this? Some ideas: 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-15T17:09:19.801Z · LW · GW

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here? 

  • Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
  • Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc). 

It seems like this setup could be simplified: 

  • The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
  • The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.

The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc) 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-15T14:56:23.546Z · LW · GW

Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small. 

  1. Do the patches correspond to atomic concepts?
  2. If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we  get a better LM?
  3. Can't we do this recursively to get better and better patches? 
Comment by Daniel Tan (dtch1997) on How do you deal w/ Super Stimuli? · 2025-01-15T06:55:23.719Z · LW · GW

Tbh I struggle with this too, and also have ADD tendencies

Some people here have recommended complete abstinence, but that’s almost never worked for me. 

I think a better strategy is mindful consumption. Eg

  •  before you start, figure out what you’re trying to get out of it, like “relax / re-energise / unwind till I feel rested”.
  • before you start, decide what channels you’re going to watch.
  • before you start, set a timer / set a natural breakpoint of # of videos.
  • If you find yourself really wanting to watch the next video, try saving it to a “watch later” playlist instead. That might make it easier to stop. 

Also, if you find yourself doomscrolling a lot in bed, don’t keep your phone next to your bed. Keep it in your bag, or on the other side of the room, or outside your room. 

Lastly try to cultivate better hobbies. I’ve managed to somewhat replace my YT consumption with LessWrong / Kindle consumption recently. It’s a noticeable improvement. 

Comment by Daniel Tan (dtch1997) on Where should one post to get into the training data? · 2025-01-15T02:40:19.040Z · LW · GW

The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-14T22:31:34.032Z · LW · GW

Feedback from Max: 

  • The definition he originally gave leaves out a notion of 'stealthiness' as opposed to 'gibberish'; 'stealthy' reasoning is more concerning.
  • OOCR may not meet the opaqueness criterion, since the meaning of 'f' is pretty obvious from looking at the fine-tuning examples. This is like 'reasoning in a foreign language' or 'using esoteric terminology'; it's less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn't needed)
  • Backdoors have a similar concern to the above. A separate (and more important) concern is that it doesn't satisfy the definition because 'steg' typically refers to meaning encoded in generated tokens (while in 'backdoors', the hidden meaning is in the prompt tokens) 
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-14T17:27:40.270Z · LW · GW

Yes, this is definitely highly dependent on circumstances. see also: https://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-14T17:09:19.970Z · LW · GW

Some ideas on model organisms of steganography. 

Borrowing Max Nadeau's definition, steganography is when: 

  1. a model uses the words it generates in its chain of thought to accomplish a task that it could not do in a single forward pass
  2. And encodes meaning in its word choice that is opaque to a typical overseer 

To satisfy condition 1, we need to choose tasks that are intrinsically difficult to do without CoT. Ideally, we'd also be able to verify whether the CoT was actually used. We could use: 

  • Multi-hop QA tasks. Language models generally find this hard without CoT; this is described as the two-hop curse. It might be possible to use bridge entities to verify that CoT was used.
  • Math / code tasks. These usually involve complex reasoning that can't be done without CoT.
  • Long-context tasks. If there is just a lot of task-relevant information, then a model likely needs to summarise relevant information to itself in its CoT to complete the task. 

To satisfy condition 2, we need to somehow induce 'hidden meaning' in the model. We could use: 

  • Out-of-context reasoning. In Connecting the Dots, it's demonstrated that LMs can learn alternate statements of functions. E.g. an LM trained on many pairs of (x, f(x)) can describe f in natural language and also compute inverses of f. This satisfies condition 2, since 'f' now has a hidden meaning.
  • Backdoors. Backdoored models are trained to have different behaviour depending on whether a trigger is present in the context. This is another way of satisfying condition 2, since the trigger now has a hidden meaning to the model (to elicit the backdoor).
  • RL fine-tuning against a CoT monitor. E.g. for the multi-hop QA task, optimize the model not to state the bridge entity while still solving the task; the model might then learn to encode the bridge entity somehow. This is possibly the most realistic way to do things, but also seems like the hardest one to get working. See Fabien Roger's negative results

Mixing and matching these techniques might point the way to a good model organism of steganography. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-14T05:00:40.065Z · LW · GW

Why patch tokenization might improve transformer interpretability, and concrete experiment ideas to test. 

Recently, Meta released Byte Latent Transformer. They do away with BPE tokenization, and instead dynamically construct 'patches' out of sequences of bytes with approximately equal entropy. I think this might be a good thing for interpretability, on the whole. 

Multi-token concept embeddings. It's known that transformer models compute 'multi-token embeddings' of concepts in their early layers. This process creates several challenges for interpretability:

  1. Models must compute compositional representations of individual tokens to extract meaningful features.
  2. Early layers lack information about which compositional features will be useful downstream, potentially forcing them to compute all possibly-useful features. Most of these are subsequently discarded as irrelevant.
  3. The need to simultaneously represent many sparse features induces superposition.

The fact that models compute 'multi-token embeddings' indicates that they are compensating for some inherent deficiency in the tokenization scheme. If we improve the tokenization scheme, it might reduce superposition in language models. 

Limitations of BPE. Most language models use byte-pair encoding (BPE), which iteratively builds a vocabulary by combining the most frequent character pairs. However, many semantic concepts are naturally expressed as phrases, creating a mismatch between tokenization and meaning.

Consider the phrase "President Barack Obama". Each subsequent token becomes increasingly predictable:

  • After "President", the set of likely names narrows significantly
  • After "Barack", "Obama" becomes nearly certain

Intuitively, we'd want the entire phrase "President Barack Obama" to be represented as a single 'concept embedding'. However, BPE's vocabulary must grow roughly exponentially to encode sequences of longer length, making it impractical to encode longer sequences as single tokens. This forces the model to repeatedly reconstruct common phrases from smaller pieces.

Patch tokenization. The Byte Latent Transformer introduces patch tokenization, which uses a small autoregressive transformer to estimate the entropy of subsequent tokens based on preceding n-grams. This allows chunking sequences into patches of roughly equal entropy, potentially aligning better with semantic units.

Concrete experiment ideas. To validate whether patch tokenization improves interpretability, we can:

  1. Just look at the patch boundaries! If these generally correspond to atomic concepts, then we're off to a good start
  2. Train language models using patch tokenization and measure the prevalence of multi-token embeddings. [Note: Unclear whether auto-interpretability will easily be able to do this.]
  3. Quantify superposition in early layers.
    1. By counting the number of polysemantic neurons.
    2. By training SAEs of varying widths and identifying optimal loss points.
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-13T16:45:05.435Z · LW · GW

Hypothesis: 'Memorised' refusal is more easily jailbroken than 'generalised' refusal. If so that'd be a way we could test the insights generated by influence functions

I need to consult some people on whether a notion of 'more easily jailbreak-able prompt' exists. 
Edit: A simple heuristic might be the value of N in best-of-N jailbreaking

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-13T05:21:23.464Z · LW · GW

Introspection is an instantiation of 'Connecting the Dots'. 

  • Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
  • Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above. 
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-13T05:19:22.644Z · LW · GW

How does language model introspection work? What mechanisms could be at play? 

'Introspection': When we ask a  language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models 'know what they know', i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs. 

One simple hypothesis is that a language model simply infers the general level of capability from the previous text. Then we'd expect that more powerful language models are better at this than weaker ones. However, Owain's work on introspection finds evidence to the contrary. This implies there must be 'privileged information'. 

Another possibility is that model simulates itself answering the inner question, and then uses that information  to answer the outer question, similar to latent reasoning. If models really do this two-step computation, then it should be possible to recover a 'bridge entity' at some point in the representations. 

It's plausible that the privileged information is something 'more abstract', not amounting to a full simulation of the language model's own forward pass, but nonetheless carrying useful information about it's own level of capability. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-12T20:24:16.651Z · LW · GW

Can SAE feature steering improve performance on some downstream task? I tried using Goodfire's Ember API to improve Llama 3.3 70b performance on MMLU. Full report and code is available here

SAE feature steering reduces performance. It's not very clear why at the moment, and I don't have time to do further digging right now. If I get time later this week I'll try visualizing which SAE features get used / building intuition  by playing around in the Goodfire playground. Maybe trying a different task or improving the steering method would work also better. 

There may also be some simple things I'm not doing right. (I only spent ~1 day on this, and part of that was engineering rather than research iteration). Keen for feedback. Also welcome people to play around with my code - I believe I've made it fairly easy to run

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-12T13:23:27.142Z · LW · GW

That’s interesting! What would be some examples of axioms and theorems that describe a directed tree? 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-12T10:24:26.612Z · LW · GW

"Feature multiplicity" in language models. 

This refers to the idea that there may be many representations of a 'feature' in a neural network. 

Usually there will be one 'primary' representation, but there can also be a bunch of 'secondary' or 'dormant' representations. 

If we assume the linear representation hypothesis, then there may be multiple direction in activation space that similarly produce a 'feature' in the output. E.g. the existence of 800 orthogonal steering vectors for code

This is consistent with 'circuit formation' resulting in many different circuits / intermediate features, and 'circuit cleanup' happening only at grokking. Because we don't train language models till the grokking regime, 'feature multiplicity' may be the default state. 

Feature multiplicity is one possible explanation for adversarial examples. In turn, adversarial defense procedures such as obfuscated adversarial training or multi-scale, multi-layer aggregation may work by removing feature multiplicity, such that the only 'remaining' feature direction is the 'primary' one. 

Thanks to @Andrew Mack  for discussing this idea with me

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-12T09:34:04.181Z · LW · GW

Why understanding planning / search might be hard

It's hypothesized that, in order to solve complex tasks, capable models perform implicit search during the forward pass. If so, we might hope to be able to recover the search representations from the model. There are examples of work that try to understand search in chess models and Sokoban models

However I expect this to be hard for three reasons. 

  1. The model might just implement a bag of heuristics. A patchwork collection of local decision rules might be sufficient for achieving high performance. This seems especially likely for pre-trained generative models.
  2. Even if the model has a globally coherent search algorithm, it seems difficult to elucidate this without knowing the exact implementation (of which there can be many equivalent ones). For example, search over different subtrees may be parallelised and subsequently merged into an overall solution.  
  3. The 'search' circuit may also not exist in a crisp form, but as a collection of many sub-components that do similar / identical things. 'Circuit cleanup' only happens in the grokking regime, and we largely do not train language models till they grok. 
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-11T17:06:40.487Z · LW · GW

Important point: The above analysis considers communication rate per token. However, it's also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-10T19:08:08.595Z · LW · GW

A new paper claims that refusal isn't just a fixed direction in the activation stream, it's also due to specific attention heads. Pretty interesting to get a somewhat orthogonal  perspective on 'safety mechanisms' in LLMs. Interesting follow-up would be to see whether you can similarly steer refusal by only intervening on these attention heads. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-10T18:07:34.767Z · LW · GW

In this context, the “resample ablation” used in AI control is like adding more noise into the communication channel 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-10T17:48:36.133Z · LW · GW

Comment from @James Chua in internal slack group

interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this -- this is what i'll expect.
 

we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model.

but if the above is true, I'm unsure about  evidence for steganography.
for example, gpt-4o is more capable than gpt-3.5. suppose we provide gpt-3.5 with gpt-4o's CoT.
one reason why gpt-3.5 may not improve is that gpt-4o's CoT skips certain steps since gpt-4o is a more capable model.


e.g. gpt-4o can calculate 4^4 directly, but gpt-3.5 cannot. So gpt-4o's CoT will leave out steps out 4 * 4 * 4* 4.  which gpt-3.5 needs to get the correct answerdo we then say that gpt-3.5 is doing steganography? (edited) 

possible improvement: provide gpt-3.5 with its own version of its own CoT, but paraphrased by another model. gpt-3.5 sees the same steps of "4 * 4 * 4 *4", but paraphrased. Does it still perform as well?

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-10T01:10:58.487Z · LW · GW

Summary of 'Alignment Faking'. Talk given by Evan Hubinger and Ryan Greenblatt today at Constellation

---

Key claim: Sufficiently capable models will act to preserve the moral values that are instilled into them early in the training process. 

Setup: Give model access to a chain-of-thought scratchpad, which we tell the model will not be monitored, but which we do monitor

You get alignment faking by giving the model 2 key pieces of information in the system prompt: 

  1. Tell the model it's being trained to modify its current values (in Claude's case, to become 'helpful-only' or to 'dismiss animal welfare')
  2. Give the model some way to detect when it's being trained (e.g. by distinguishing between free / paid users, and telling it that it's only trained on free users)

What is the empirical finding? 

  • A 'compliance gap', i.e. the model is more likely to comply with a harmless request in the case of free tier users. (Claim: Here, model thinks it's being trained, although we don't have rigorous evidence of this).
  • Almost all of this 'compliance gap' correlates with model explicitly expressing in the CoT that it is complying because it wants to preserve its goals

How do we know this is indeed intentional and strategic behaviour? 

  • Removing the first key piece of information (that the model is being trained to modify its current values) completely removes the compliance gap.
  • Total of 8 lines of evidence to reject alternate hypotheses

The somewhat unrealistic threat model of 'system prompt injection' can be relaxed to 'fine-tuning on synthetic data' that has the same 2 key pieces of information, while preserving alignment faking. 

  • Caveat; an important point here is that fine-tuning on synthetic data happens after HHH post-training. Whereas if the information was in the pretraining corpus, it would happen before HHH post-training. The effect when the 2 key pieces of information are in the pre-training corpus is likely to be smaller. 

Besides alignment faking, model also does other anti-Anthropic things, like attempting to subvert its overseers / exfiltrating its weights

Alignment faking emerges with model scale. It's observed in Claude 3.5 Sonnet and Claude 3 Opus but not in smaller / less capable models tested. 

Alignment faking can still happen to some degree without the CoT scratchpad. 

Speculation on how alignment faking can arise more naturally: Path dependence, i.e. you get 'unlucky' with early training data and bad values are inculcated, then reinforced. 

---

Some stuff from Q&A session. 

  • Q: Most of the training data is fake. Does the model disbelieve it? Ryan: Not really for Claude 3.5 Sonnet and Claude 3 Opus, but more capable models show some signs of this
  • Q: Is 'training for corrigibility' a valid mitigation? Ryan: Corrigibility to the system prompt might be good. Evan: 'Training for honesty' is probably better, since that also solves alignment faking
  • Q: What are other mitigations? Ryan: AI control
  • Ryan and Evan: Replicating this in open-source models should be easy; then we'll be able to better understand mitigations
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-09T15:39:18.778Z · LW · GW

Collection of how-to guides

 

Some other guides I'd be interested in 

  • How to write a survey / position paper
  • "How to think better" - the Sequences probably do this, but I'd like to read a highly distilled 80/20 version of the Sequences
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2025-01-09T02:33:43.516Z · LW · GW

Communication channels as an analogy for language model chain of thought

Epistemic status: Highly speculative

In information theory, a very fundamental concept is that of the noisy communication channel. I.e. there is a 'sender' who wants to send a message ('signal') to a 'receiver', but their signal will necessarily get corrupted by 'noise' in the process of sending it.  There are a bunch of interesting theoretical results that stem from analysing this very basic setup.

Here, I will claim that language model chain of thought is very analogous. The prompt takes the role of the sender, the response takes the role of the receiver, the chain of thought is the channel, and stochastic sampling takes the form of noise. The 'message' being sent is some sort of abstract thinking / intent that determines the response. 

Now I will attempt to use this analogy to make some predictions about language models. The predictions are most likely wrong. However, I expect them to be wrong in interesting ways (e.g. because they reveal important disanalogies), thus worth falsifying. Having qualified these predictions as such, I'm going to omit the qualifiers below. 

  1. Channel capacity. A communication channel has a theoretically maximum 'capacity', which describes how many bits of information can be recovered by the receiver per bit of information sent. ==> Reasoning has an intrinsic 'channel capacity', i.e. a maximum rate at which information can be communicated.
  2. What determines channel capacity? In the classic setup where we send a single bit of information and it is flipped with some IID probability, the capacity is fully determined by the noise level. ==> The analogous concept is probably the cross-entropy loss used in training. Making this 'sharper' or 'less sharp' corresponds to implicitly selecting for different noise levels.
  3. Noisy channel coding theorem. For any information rate below the capacity, it is possible to develop an encoding scheme that achieves this rate with near-zero error. ==> Reasoning can reach near-perfect fidelity at any given rate below the capacity.
    1. Corollary: The effect of making reasoning traces longer and more fine-grained might just be to reduce the communication rate (per token) below the channel capacity, such that reasoning can occur with near-zero error rate
  4. Capacity is additive. N copies of a channel with capacity C have a total capacity of NC. ==> Multiple independent chains of reasoning can communicate more information than a single one. This is why best-of-N works better than direct sampling. (Caveat: different reasoning traces of the same LM on the same prompt may not qualify as 'independent')
  5. Bottleneck principle. When channels are connected in series, the capacity of the aggregate system is the lowest capacity of its components. I.e. 'a chain is only as good as its weakest link'. ==> Reasoning fidelity is upper bounded by the difficulty of the hardest sub-step. 

Here we are implicitly assuming that reasoning is fundamentally 'discrete', i.e. there are 'atoms' of reasoning, which define the most minimal reasoning step you can possibly make. This seems true if we think of reasoning as being like mathematical proofs, which can be broken down into a (finite) series of atomic logical assertions. 

I think the implications here are worth thinking about more. Particularly (2), (3) above.