david-lindner

Posts
Comments

Posts

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure 2025-04-12T23:15:07.964Z

Can LLMs learn Steganographic Reasoning via RL? 2025-04-11T16:33:03.378Z

MONA: Managed Myopia with Approval Feedback 2025-01-23T12:24:18.108Z

On scalable oversight with weak LLMs judging strong LLMs 2024-07-08T08:59:58.523Z

VLM-RM: Specifying Rewards with Natural Language 2023-10-23T14:11:34.493Z

Practical Pitfalls of Causal Scrubbing 2023-03-27T07:47:31.309Z

Threat Model Literature Review 2022-11-01T11:03:22.610Z

Clarifying AI X-risk 2022-11-01T11:03:01.144Z

Comments

Comment by David Lindner on VLM-RM: Specifying Rewards with Natural Language · 2023-10-23T22:01:32.913Z · LW · GW

The agents are rewarded at every timestep and we want them to perform the task throughout the whole episode, so falling over is definitely not what we want. But this has more to do with the policy optimization failing than with the reward model. In other words a policy that doesn't fall over would achieve higher reward than the policies we actually learn. For example, if we plot the CLIP reward over one episode, it typically drops at the end of the episode if the agent falls down.

We tried some tricks to improve the training, such as providing a curriculum starting from short episodes to longer ones. This worked decently well and made the agents fall over less, but we ended up not using it in the final experiments because we primarily wanted to show that it works well with off-the-shelf RL algorithms.

Comment by David Lindner on Practical Pitfalls of Causal Scrubbing · 2023-03-29T20:32:07.293Z · LW · GW

Thanks, that's a useful alternative framing of CaSc!

FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is "cancellation" which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems.

Plausibly you'd run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on "how wrong" your hypothesis is.

Comment by David Lindner on Practical Pitfalls of Causal Scrubbing · 2023-03-27T19:11:01.705Z · LW · GW

Yes, this seem like a plausible confusion. Your interpretation of what we mean is correct.

User info

Posts

Comments