Experiments to Test the Probability of Strategic Deceptive Misalignment?

post by Oliver Daniels (oliver-daniels-koch) · 2024-01-18T00:13:35.804Z · LW · GW · No comments

This is a question post.

Contents

No comments

Feels like a crux with the shard theory crowd (Nora and Quintin) and the MIRI crowd is whether the the inductive biases of SGD + Training Data favor instrumental reasoning about a unitary goal. Also feels like we could empirically test this and update (maybe only slightly) one way or another. 

For example, we could show a language model two nearly identical tasks A and B, where task A has some score associated with task completion and a prompt prefix "complete this task to maximize reward". We fine-tune both models using RL and analyze which converges more quickly / achieves higher performance / generalized better to unseen (but similar) tasks. Genuinely unsure what would happen - instrumental reasoning introduces more bloat, but maybe pushes model to compress knowledge more quickly or something? The personna effect of "agent who's maximizing score" could dominate in a way that's probably not that relevant to the underlying concern, but this might be mitigated by comparing generalization with model checkpoints that achieve similar training performance.  

Curious if anyone thinks this is a useful frame / has experiment ideas along these lines

Answers

No comments

Comments sorted by top scores.