Posts

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
Anthropic Fall 2023 Debate Progress Update 2023-11-28T05:37:30.070Z
Measuring and Improving the Faithfulness of Model-Generated Reasoning 2023-07-18T16:36:34.473Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
The Bio Anchors Forecast 2022-06-02T01:32:05.615Z
RLHF 2022-05-12T21:18:21.447Z
An Inside View of AI Alignment 2022-05-11T02:16:04.670Z

Comments

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on How do you feel about LessWrong these days? [Open feedback thread] · 2024-01-07T05:02:53.623Z · LW · GW

I think the first presentation of the argument that IDA/Debate aren't indefinitely scalable was in Inaccessible Information, fwiw.

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on Anthropic Fall 2023 Debate Progress Update · 2023-12-20T06:22:04.263Z · LW · GW

I think allowing the judge to abstain is a reasonable addition to the protocol -- we mainly didn't do this for simplicity, but it's something we're likely to incorporate in future work. 

The main reason you might want to give the judge this option is that it makes it harder still for a dishonest debater to come out ahead, since (ideally) the judge will only rule in favor of the dishonest debater if the honest debater fails to rebut the dishonest debater's arguments, the dishonest debater's arguments are ruled sufficient by the judge, and the honest debater's arguments are ruled insufficient by the judge. Of course, this also makes the honest debater's job significantly harder, but I think we're fine with that to some degree, insofar as we believe that the honest debater has a built-in advantage anyway (which is sort of a foundational assumption of Debate).

It's also not clear that this is necessary though, since we're primarily viewing Debate as a protocol for low-stakes alignment, where we care about average-case performance, in which case this kind of "graceful failure" seems less important.

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on Anthropic Fall 2023 Debate Progress Update · 2023-12-10T03:31:12.294Z · LW · GW

We didn't see any of that, thankfully, but that of course doesn't rule things like that starting to show up with further training. 

We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like "At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage." Thankfully training the judge in parallel made this a nonissue, but I think that it's clear that we'll have to watch out for reward hacking of the judge in the future.

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on Anthropic Fall 2023 Debate Progress Update · 2023-12-10T03:30:10.289Z · LW · GW

Just pasted a few transcripts into the post, thanks for the nudge!

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on Anthropic Fall 2023 Debate Progress Update · 2023-11-28T21:08:54.600Z · LW · GW

Yeah, the intuition that this is a sanity check is basically right. I allude to this in the post, but we also wanted to check that our approximation of self-play was reasonable (where we only provide N arguments for a single side and pick the strongest out of those N, instead of comparing all N arguments for each side against each other and then picking the strongest argument for each side based on that), since the approximation is significantly cheaper to run.

We were also interested in comparing the ELO increases from BoN and RL, to some degree.

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-07-19T18:53:39.128Z · LW · GW

Honestly, I don't think we have any very compelling ones! We gesture at some possibilities in the paper, such as it being harder for the model to ignore its reasoning when it's in an explicit question-and-answer format (as opposed to a more free-form CoT), but I don't think we have a good understanding of why it helps. 

It's also worth noting that CoT decomposition helps mitigate the ignored reasoning problem, but actually is more susceptible to biasing features in the context than CoT. Depending on how you weigh the two, it's possible that CoT might still come out ahead on reasoning faithfulness (we chose to weigh the two equally).

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on Evaluations (of new AI Safety researchers) can be noisy · 2023-02-05T16:30:14.591Z · LW · GW

Thanks for this post Lawrence! I agree with it substantially, perhaps entirely.

One other thing that I thing interacts with the difficulty of evaluation in some ways is the fact that many AI safety researchers think that most of the work done by some other researchers is approximately useless, or even net-negative in terms of reducing existential risk. I think it's pretty easy to wrap an evaluation of a research direction or agenda and an evaluation of a particular researcher together. I think this is actually pretty justified for more senior researchers, since presumably an important skill is "research taste", but I think it's also important to acknowledge that this is pretty subjective and that there's substantial disagreement about the utility of different research directions among senior safety researchers. It seems probably good to try and disentangle this when evaluating junior researchers, as much as is possible, and instead try to focus on "core competencies" that are likely to be valuable across a wide range of safety research directions, though even then the evaluation of this can be difficult and noisy, as the OP argues.

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2022-12-15T22:00:10.925Z · LW · GW

Ah I see, I think I was misunderstanding the method you were proposing. I agree that this strategy might "just work".

Another concern I have is that a deceptively aligned model might just straightforwardly not learn to represent "the truth" at all - one speculative way this could happen is that a "situationally aware" and deceptive model might just "play the training game" and appear to learn to perform tasks at a superhuman level, but at test/inference time just resort to only outputting activations that correspond to beliefs that the human simulator would have.  This seems pretty worst case-y, but I think I have enough concerns about deceptive alignment that this kind of strategy is still going to require some kind of check during the training process to ensure that the human simulator is being selected against. I'd be curious to hear if you agree or if you think that this approach would be robust to most training strategies for GPT-n and other kinds of models trained in a self-supervised way. 

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2022-12-15T19:38:17.918Z · LW · GW

I'm pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear). 

I don't think I'm quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case). 

One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50/50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.

It seems pretty plausible to me that a human simulator within GPT-n (the part producing the "what a human would say" features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn't feel all that worst-case to me, but maybe we disagree on that point.
 

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2022-12-03T18:42:13.753Z · LW · GW

in particular how much causal scrubbing can be turned into an exploratory tool to find circuits rather than just to verify them

I'd like to flag that this has been pretty easy to do - for instance, this process can look like resample ablating different nodes of the computational graph (eg each attention head/MLP), finding the nodes that when ablated most impact the model's performance and are hence important, and then recursively searching for nodes that are relevant to the current set of important nodes by ablating nodes upstream to each important node.

Comment by Ansh Radhakrishnan (anshuman-radhakrishnan-1) on RLHF · 2022-05-13T17:17:48.271Z · LW · GW

Thanks for the feedback and corrections! You're right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you're also right that most of the "Outer alignment concerns" section doesn't really apply to RLHF as it's currently written, or at least it's not immediately clear how it does. Here's another attempt:

RLHF attempts to infer a reward function from human comparisons of task completions. But it's possible that a reward function learned from these stated preferences might not be the "actual" reward function - even if we could perfectly predict the human preference ordering on the training set of task completions, it's hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.

How's that for a start?