Novel Idea Generation in LLMs: Judgment as Bottleneck
post by Davey Morse (davey-morse) · 2025-04-19T15:37:33.319Z · LW · GW · 0 commentsContents
No comments
In the face of any hard problem—reversing climate change, curing cancer, or starting a great novel—modern LLMs can generate thousands of possible solutions relatively cheaply.
Most solutions from most prompts are bad: they’re not new relative to the state of the art, not feasible, or not significant enough.
But for every thousand ideas an LLM has about how to solve a problem, a few are likely to be good.
Now that LLMs are idea‑generation machines—able to produce ideas so cheaply, even if most are bad—the thing preventing us from waking up with promising climate‑change solutions in our inbox (or whichever problem you care about) comes down to an LLM’s ability to pick those few good ideas out of a thousand crap ones. In other words, I’d guess the rate‑limiting step isn’t generating good ideas but choosing, from among a thousand mostly random ones, the promising few.
At least, that was the bottleneck I perceived in my recent Oscillating Creativity Machine experiment. You give the machine a problem, like solving climate change, and it runs through ten rounds: it generates three possible solutions, picks the most interesting one, then generates three variations of that, and picks one again—ten times in all.
My hypothesis was that if it could pick well at each stage, even if the ideas generated sucked… well, that’s sort of what people do in the shower, in their sleep, or on walks when they’re stuck and then have eureka moments. We oscillate between chaos (possibilities) and order (pruning, picking one path). That’s roughly what I do when I come up with ideas I like.
But the Oscillating Creativity Machine didn’t work well in my tests. It didn’t end up picking a great idea—at least, not by my standards. And it didn’t come up with great ideas, which may have been a limiting factor too.
PhDs build their judgment over decades of real‑life experience: being exposed to experiments at the edge of their knowledge, then getting a sense of which new ideas or data actually help them solve the problems they set out to solve. LLMs, at least as I’ve prompted ChatGPT/Claude/Grok3 so far, don’t seem to have that judgment in the face of unsolved problems.
If LLMs could judge ideas well against unsolved challenges, we could have them generate millions of ideas for every problem that matters, then rate each and surface only the best. You could imagine this automated solution‑discovery process as the key to human flourishing.
But judgment is hard to offload to LLMs so far. Maybe all that’s required is RLHF with a ton of expert data. I’m curious who’s made the most progress here.
0 comments
Comments sorted by top scores.