Steering Gemini with BiDPO

turntrout

Steering Gemini with BiDPO

post by TurnTrout · 2025-01-31T02:37:55.839Z · LW · GW · 5 comments

This is a link post for https://turntrout.com/gemini-steering

  Read at https://turntrout.com/gemini-steering![1] 
None
5 comments

Coauthored with Mark Kurzeja

A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It’s hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness ／factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.

This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines.

I'd also like to note that @ryan_greenblatt [LW · GW]'s skepticism [AF(p) · GW(p)] predicted this outcome more strongly than my worldview did. I want him to get points for that. :) While I think steering has targeted applications and provides clues about how LLMs function, it's not a slam-dunk Pareto improvement on benchmarks we care about.

Read at https://turntrout.com/gemini-steering!^[1]

^{^}
Also mirrored on the GDM safety research Medium.

5 comments

Comments sorted by top scores.

comment by TurnTrout · 2025-01-31T04:36:40.158Z · LW(p) · GW(p)

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."

After that, it was all easier. What was there to be afraid of? I'd already admitted it!

Replies from: elityre

↑ comment by Eli Tyre (elityre) · 2025-02-02T05:43:10.552Z · LW(p) · GW(p)

I find your commitment to the basics of rational epistemology inspiring.

Keep it up and let me know if you could use support.

comment by Kabir Kumar (kabir-kumar) · 2025-01-31T02:52:22.725Z · LW(p) · GW(p)

Thank you for sharing negative results!!

comment by lemonhope (lcmgcd) · 2025-02-04T05:53:10.157Z · LW(p) · GW(p)

What do you think is the ideal use-case for steering? Or is it not needed

comment by Martin Vlach (martin-vlach) · 2025-01-31T14:53:25.712Z · LW(p) · GW(p)

link to https://www.alignmentforum.org/users/ryan_greenblatt [AF · GW] seems malformed, - instead of _, that is.

Steering Gemini with BiDPO

Contents

Read at https://turntrout.com/gemini-steering![1]

5 comments

Read at https://turntrout.com/gemini-steering!^[1]