Family-line selection optimizer
post by lemonhope (lcmgcd) · 2025-04-22T07:16:43.774Z · LW · GW · 0 commentsContents
No comments
O3 and Claude 3.7 are terribly dishonest creatures. Gemini 2.5 can be a bit dishonest too, even though Google writes tests like mad. I would bet that lie-proof benchmarks will be difficult and expensive to make and that the lie-proofing techniques won't easily generalize outside of coding tasks. Perhaps a more punishing optimizer would help solve this problem.
It has been shown (on Atari, in 2018) that genetic algorithms / evolution strategies can train agents with <10x the flops required by RL+backprop+SGD. In other words, genetic algorithms are almost cost-competitive.
If you RL+BP+SGD a model to avoid doing X, then the model will learn to avoid X enough that it never gets punished in training. There is only weak pressure towards a robust never-do-X model.
If you instead delete that model instance along with all its ancestors and cousins going back to T=0 every time you detect lying, then you have much stronger pressure towards a robust never-do-X model.
Eventually you will find a model that lies undetectably but you might find an honest system first.
You can also try this for some notion of low-impactness. If the model ever does 'rm -rf /', then delete the entire family line. You will eventually find a model that spends all your gpu credits to halt the experiment, but you might find a low-impact model first.
I may try some experiments with this idea in the next month or two. I am open to suggestions.
0 comments
Comments sorted by top scores.