Goodhart's Law Example: Training Verifiers to Solve Math Word Problems

post by Chris_Leong · 2023-11-25T00:53:26.841Z · LW · GW · 2 comments

This is a link post for



Sharing because of how clearly this paper demonstrates the risks associated with considering too many different solutions:

"At test time, we can choose to generate arbitrarily many solutions to be judged by the verifier before selecting the highest-ranked completion. Figure 7a shows how 6B verifier performance varies with the number of completions per test problem. At this scale, performance improves as we increase the number of completions up to 400. Beyond this point, performance starts to decrease. This suggests that the benefits of search are eventually outweighed by the risk of finding adversarial solutions that fool the verifier. In general, we evaluate verifier test performance using 100 completions, since this captures most of the benefits of verification with a relatively modest compute cost."

They propose and evaluate a solution to this:

"To further increase performance, we can take a majority vote among the top verifier-ranked solutions instead of selecting only the single top solution. This voting process considers only the final answer reached by the individual solutions: the final answer selected is the one with the most votes. Figure 7b shows how performance varies as we allow a greater number of top samples to cast a vote. Unsurprisingly, when starting with a greater number of samples, we can afford to allow a greater number of samples to cast a vote. When we have only 100 samples, it is optimal to allow only the top 3-5 samples to cast a vote. When we have 3200 samples, it is approximately optimal to allow the top 30 to cast a vote."

Further empirical investigation into Goodhart's Law could prove valuable for alignment.


Comments sorted by top scores.

comment by Mateusz Bagiński (mateusz-baginski) · 2023-11-25T10:03:37.699Z · LW(p) · GW(p)

Typo: it's "Goodhart", not "Goodheart"

comment by quila · 2023-11-25T16:03:42.218Z · LW(p) · GW(p)

I'm curious what the adversarial examples are like