Family-line selection optimizer

lemonhope

Family-line selection optimizer

post by lemonhope (lcmgcd) · 2025-04-22T07:16:43.774Z · LW · GW · 0 comments

No comments

O3 and Claude 3.7 are terribly dishonest creatures. Gemini 2.5 can be a bit dishonest too, even though Google writes tests like mad. I would bet that lie-proof benchmarks will be difficult and expensive to make and that the lie-proofing techniques won't easily generalize outside of coding tasks. Perhaps a more punishing optimizer would help solve this problem.

It has been shown (on Atari, in 2018) that genetic algorithms / evolution strategies can train agents with <10x the flops required by RL+backprop+SGD. In other words, genetic algorithms are almost cost-competitive.

If you RL+BP+SGD a model to avoid doing X, then the model will learn to avoid X enough that it never gets punished in training. There is only weak pressure towards a robust never-do-X model.

If you instead delete that model instance along with all its ancestors and cousins going back to T=0 every time you detect lying, then you have much stronger pressure towards a robust never-do-X model.

Eventually you will find a model that lies undetectably but you might find an honest system first.

You can also try this for some notion of low-impactness. If the model ever does 'rm -rf /', then delete the entire family line. You will eventually find a model that spends all your gpu credits to halt the experiment, but you might find a low-impact model first.

I may try some experiments with this idea in the next month or two. I am open to suggestions.

0 comments

Comments sorted by top scores.

Family-line selection optimizer

Contents

0 comments