Closed-ended questions aren't as hard as you think

electroswing

Closed-ended questions aren't as hard as you think

post by electroswing · 2025-02-19T03:53:11.855Z · LW · GW · 0 comments

  Summary
  #1 Selection bias causes questions to skew easy
  #1A Subpoint: it's tough to write closed-ended questions about problems that are still partially open (e.g., gap between lower and upper bounds)
  #2 Easy jargon-heavy questions overrepresented, difficult but deceptively simple questions underrepresented 
  Conclusion
None
No comments

Summary

In this short post, I argue that closed-ended questions, even those of arbitrary difficulty, are not as difficult as they may appear. In particular, I argue that the benchmark HLE is probably easier than it may first seem.^[1]

Specifically, I argue:

Crowd workers find it easier to write easy questions than hard questions. Thus, selection bias causes questions to skew easy.
- In particular, it's tough to write closed-ended questions about problems that are still partially open. For example, there are many difficult and interesting questions which are open problems with a gap between the best lower and upper bounds. It's difficult to formulate a closed-ended question eliciting knowledge about these bounds, without giving away the answer.
Question evaluators are time-constrained, and can't perfectly determine the difficulty of a question. It's much easier to judge how much jargon a question uses. Thus, jargon-heavy questions that are relatively easy may be overrepresented, and deceptively simple questions that are actually difficult maybe underrepresented.

My background is in mathematics, so in this post I'll be focusing on issues that arise in math question-writing. (Currently, HLE is 41% math questions.)

#1 Selection bias causes questions to skew easy

HLE questions are crowdsourced. They are written by crowd workers (e.g., random PhD students with a free evening), and evaluated by a noisy process (time-constrained Scale.AI employees and LLMs).

Crowd workers are incentivized to get as much prize money as possible. Initially, HLE offered $500,000 of prize money: $5,000 each to the top 50 submissions, and $500 each to each of the next best top 500 submissions. Most people are risk averse. Given the structure of the prizes, why sink all of your time into writing one really good question, when you can instead submit several mediocre questions (and potentially get multiple prizes)?

Thus, the median person probably submitted a couple of "nice" questions they had on hand: questions that are easy to state and easy to write a solution for.^[2] They probably didn't go through the difficult exercise of thinking: what are some of the more thorny concepts in my subfield? How might I turn these into a tricky closed-ended question?

The question set is probably pretty good overall! My point is just that, conditional on a question coming from a specific expertise area, it probably skews easy, due to selection bias.

#1A Subpoint: it's tough to write closed-ended questions about problems that are still partially open (e.g., gap between lower and upper bounds)

In combinatorics and theoretical computer science,^[3] many questions are phrased in terms of giving lower and upper bounds. For example, someone might ask: "What is the greatest possible number of stable matchings that a single matching market instance can have?" (Knowing the problem details is not necessary here. If you like, replace the question with: "What is the greatest possible number of X that a problem instance can have?")

This is currently an open question. The best lower bound is , that is, there exists a matching market instance of size $n$ with that many stable matchings (source). The best upper bound is ${3.55}^{n}$ , that is, it has been mathematically proven that there cannot be a matching market instance of size $n$ with that many stable matchings (source).

Because this problem is open, it's tough to pose a closed-ended question about it.

You can't ask to produce the best possible lower bound. What if there is a construction better than ${2.28}^{n}$ ? There is no way to verify this in a closed-ended environment.
You can't ask to prove an upper bound. There are potentially many different ways to prove upper bounds for this problem, even if the true bound is ${3.55}^{n}$ , and you have no way of asking this question in a closed-ended way.
You could ask for it to produce a construction that yields ${2.28}^{n}$ (it's probably possible to get around uniqueness concerns)... but this question is much easier than the rest of the research question is. It's easier to solve a problem like this when you have a target to aim for.

Overall, whenever a research question is open in this way -- lower and upper bounds with a gap -- the only closed-ended questions that can be posed are the easiest parts of the problem.

#2 Easy jargon-heavy questions overrepresented, difficult but deceptively simple questions underrepresented

Quoting myself from several paragraphs ago:

HLE questions are crowdsourced. They are written by crowd workers (e.g., random PhD students with a free evening), and evaluated by a noisy process (time-constrained Scale.AI employees and LLMs).

In math, sometimes problems sound easy but are very difficult. See, for example, Erdős problems. A time-constrained question evaluator, even if they are an expert in a similar area, might not be able to fully grok the difficulty of a question from the problem statement and solution description alone.

In particular, things that can be hard to accurately estimate include:

Is the solution using super standard techniques, or is the solution surprising / different / using an ingenious "trick"?
Is the jargon in the question deep (involves years of graduate study to understand), or relatively easy to pick up? (Answering this question might not just depend on the jargon itself, but also how it is invoked. For example, are implicit results pertaining to the jargon assumed when it is being used?)

What HLE evaluators are trying to select for is difficult, "research-level" questions. It's tough to answer the above two questions precisely, so inevitably they will have to use proxies. One practical proxy is how jargon-heavy a question is. (There may be others, such as solution length, but I am most confident in the point about jargon.)

Conclusion

For the above reasons, my model of the math questions in HLE is currently "test questions for first- and second-year PhD students", so similar to GPQA.^[4]

Accordingly, I view the next big open question on the path to building STEM AI to be the design of open-ended STEM benchmarks.

^{^}
I'm not saying HLE isn't hard or a useful benchmark! It is. I am just recommending people consider these points, and (if appropriate) downweigh their perception of how difficult the benchmark is.
^{^}
This is the case for the ~5 people I know who seriously submitted questions.
^{^}
I imagine similar principles apply to other fields outside of combinatorics and theoretical computer science. For example, in physics, people often apply approximations in creative ways. Perhaps, for similar reasons, it might be difficult to write closed-ended questions eliciting these skills.
^{^}
Comparing accuracy numbers on GPQA and HLE directly is misleading. GPQA is multiple-choice questions with 4 options, and HLE can be open-ended. (And with GPQA, the four answer choices have the potential to leak a lot of information about how to solve the question.)

0 comments

Comments sorted by top scores.

Closed-ended questions aren't as hard as you think

Contents

Summary

#1 Selection bias causes questions to skew easy

#1A Subpoint: it's tough to write closed-ended questions about problems that are still partially open (e.g., gap between lower and upper bounds)

#2 Easy jargon-heavy questions overrepresented, difficult but deceptively simple questions underrepresented

Conclusion

0 comments