Posts

Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024-01-12T19:51:01.021Z
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research 2023-08-08T01:30:10.847Z
How do I Optimize Team-Matching at Google 2022-02-24T22:10:50.793Z

Comments

Comment by Carson Denison (carson-denison) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-05-01T00:57:51.123Z · LW · GW

This is cool work! There are two directions I'd be excited to see explored in more detail:

  1. You mention that you have to manually tune the value of the perturbation weight R. Do you have ideas for how to automatically determine an appropriate value? This would significantly increase the scalability of the technique. One could then easily generate thousands of unsupervised steering vectors and use a language model to flag any weird downstream behavior.
  2. It is very exciting that you were able to uncover a backdoored behavior without prior knowledge. However, it seems difficult to know whether weird behavior from a model is the result of an intentional backdoor or whether it is just a strange OOD behavior. Do you have ideas for how you might determine if a given behavior is the result of a specific backdoor, or how you might find the trigger? I imagine you could do some sort of GCG-like attack to find a prompt which leads to activations similar to the steering vectors. You might also be able to use the steering vector to generate more diverse data on which to train a traditional dictionary with an SAE.
Comment by Carson Denison (carson-denison) on The Worst Form Of Government (Except For Everything Else We've Tried) · 2024-03-23T19:52:21.658Z · LW · GW

Having just finished reading Scott Garrabrant's sequence on geometric rationality: https://www.lesswrong.com/s/4hmf7rdfuXDJkxhfg 
These lines:
- Give a de-facto veto to each major faction
- Within each major faction, do pure democracy.
Remind me very much of additive expectations / maximization within coordinated objects and multiplicative expectations / maximization between adversarial ones. For example maximizing expectation of reward within a hypothesis, but sampling which hypothesis to listen to for a given action according to their expected utility rather than just taking the max.

Comment by Carson Denison (carson-denison) on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2023-08-08T17:33:47.676Z · LW · GW

Thank you for catching this. 

These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.