Steering Gemini with BiDPO
post by TurnTrout · 2025-01-31T02:37:55.839Z · LW · GW · 2 commentsThis is a link post for https://turntrout.com/gemini-steering
Contents
Read at https://turntrout.com/gemini-steering![1] None 2 comments
Coauthored with Mark Kurzeja
A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It’s hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness /factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.
This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines.
I'd also like to note that @Ryan Greenblatt [AF · GW]'s skepticism [AF(p) · GW(p)] predicted this outcome more strongly than my worldview did. I want him to get points for that. :) While I think steering has targeted applications and provides clues about how LLMs function, it's not a slam-dunk Pareto improvement on benchmarks we care about.
Read at https://turntrout.com/gemini-steering![1]
- ^
Also mirrored on the GDM safety research Medium.
2 comments
Comments sorted by top scores.
comment by TurnTrout · 2025-01-31T04:36:40.158Z · LW(p) · GW(p)
I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."
After that, it was all easier. What was there to be afraid of? I'd already admitted it!
comment by Kabir Kumar (kabir-kumar) · 2025-01-31T02:52:22.725Z · LW(p) · GW(p)
Thank you for sharing negative results!!