Ideas for benchmarking LLM creativity
post by gwern · 2024-12-16T05:18:55.631Z · LW · GW · 10 commentsThis is a link post for https://gwern.net/creative-benchmark
Contents
10 comments
10 comments
Comments sorted by top scores.
comment by RogerDearnaley (roger-d-1) · 2024-12-18T08:38:35.199Z · LW(p) · GW(p)
People who train text-to-image generative models have had a good deal of success with training (given a large enough and well-enough human-labeled training set) an "aesthetic quality" scoring model, and then training a generative image model to have "high aesthetic quality score" as a text label. Yes, doing things like this can produces effects like the recognizable Midjourney aesthetic, which can be flawed, and generally optimizing such things too hard leads to sameness — but if trained well such models' idea of aesthetic quality is at least pretty close to most human judgements. Presumably what can be done for images can also be done for prose, poetry, or fiction as well.
There isn't a direct equivalent of that approach for an LLM, but RLHF seems like a fairly close equivalent. So far people have primarily used RLHF for "how good is the answer to my question?" Adapting a similar approach to "how high quality is the poetry/prose/fiction produced by the model?" is obviously feasible. Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.
Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth.
The RLHF approach only trains a single aesthetic, and probably shouldn't be taken too far or optimized too hard: while there is some widespread agreement about what prose is good vs, dreadful, finer details of taste vary, and should do so. So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.
These ideas have been phrased as model-post-training suggestions, but turning these into a benchmark is also feasible: the "Aesthetic quality scoring model" from the RLHF approach is in itself a benchmark, and the "prompt containing reviews and statistics -> literary work" approach could also be inverted to instead train a reviewer model to review literary works from various different aesthetic viewpoints, and estimate their likely sales/critical reception.
Replies from: gwern↑ comment by gwern · 2024-12-18T19:17:48.603Z · LW(p) · GW(p)
but if trained well such models' idea of aesthetic quality is at least pretty close to most human judgements
That does not follow. Preference learning involves almost no learning of preferences. A suit cut to fit all may wind up fitting none - particularly for high-dimensional things under heavy optimization, like, say, esthetics, where you want to apply a lot of selection pressure to get samples which are easily 1-in-10,000 or rarer, and so 'the tails come apart'.
(How much variance is explained by individual differences in preference learning settings like comparing image generators? A great question! And you'll find that hardly any one has any idea. As it happens, I asked the developer of a major new image generator this exact question last night, and not only did he have no idea, it looked like it had never even occurred to him to wonder what the performance ceiling without personalization could be or to what extent all of the expensive ratings they were paying for reflected individual rater preferences rather than some 'objective' quality or if they were even properly preserving such metadata rather than, like it seems many tuning datasets do, throwing it out as 'unnecessary'.)
but if trained well such models' idea of aesthetic quality is at least pretty close to most human judgements....Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.
No. This is fundamentally wrong and what is already being done and what I am criticizing. There is no single 'taste' or 'quality'. Individual differences are real.{{citation needed}} People have different preferences.{{citation needed}} No change in the 'cross-section' changes that (unless you reduce the 'people' down to 1 person, the current user). All you are doing is again optimizing for the lowest common denominator. Changing the denominator population doesn't change that.
Seriously, imagine applying this logic anywhere else, like food!
Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth...So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.
Conditioning won't change the mode collapse, except as you are smuggling in individuals by the backdoor like developing an implicit model of individual reviewers' preferences. (In which case, far better to just condition on all individuals...)
and generally optimizing such things too hard leads to sameness ...The RLHF approach only trains a single aesthetic, and probably shouldn't be taken too far or optimized too hard
Well, yes, that's the problem. It has been taken too far and optimized too hard for a single quality score, and that's where we are now already. How do we provide better benchmarks where optimizing harder won't just worsen the problem?
comment by wassname · 2024-12-18T05:05:29.648Z · LW(p) · GW(p)
Have you considered using an idea similar to Schmidhuber's blogpost "Artificial Curiosity & Creativity Since 1990-91". Here you try to assess what might be called "learnable compression", "reducible surprise", or "understandable novelty" (however you want to frame it).
If an LLM, which has read the entire internet, is surprised by a text, then that's a good start. It means the text is not entirely predictable and therefore boring.
But what about purely random text! That's unpredictable, just like Einstein's Theory of General Relativity was. This is the noisy TV problem. So how do we distinguish between them. Well, Schmidhuber suggests that a text should be less surprising after you have read it. We could approximate this in LLM's by putting a summary in context, fine-tuning, adapter tuning, or similar.
This is a nice approach, because it would work for detecting human slop, too. And would be much better than plaugerism detectors which do not work.
I've had a few tries at implementing this using adapters, fine-tuning, in context-learning etc. I managed to get some promising results with fine-tuning, but this is a pretty resource intensive way to do it.
Replies from: gwern↑ comment by gwern · 2024-12-18T19:02:46.416Z · LW(p) · GW(p)
I am familiar with Schmidhuber's ideas, yes. But I had to come up with these alternatives because his would not work here, and I'm not sure they work anywhere.
His compression acceleration metric isn't too useful here, and most forms of 'compression' (or anything involving a likelihood) are not helpful here at all, because you don't have access to anything like that in most cases. For example, ChatGPT doesn't give you the full logits (actually, I'm not sure if they give it at all - I recall OA saying they were planning to expose them again in a very limited fashion but not if they actually did), and tuned models don't have logits, they have value estimates, which used to be log-likelihood-related logits but no longer are.
Any diversity/creativity benchmark which can't be run on ChatGPT & Claude & Gemini is dead on arrival and of no interest to me. We don't need numbers from the open-weights models, we need numbers on the models being used the most at the frontier and generating the most tokens worldwide that you'll be reading forever - the closed models, which do not give you such things as logits or whitebox finetuning etc. If it can't be done by calling a standard text completion API, then I ignored it.
I am also doubtful that the compression metrics really work at finite samples or capture what we mean by creativity in generative models. Like all of Schmidhuber's work, he has never gotten it working on more than toy problems (if even that), and when I look at actual compression losses on text, like gzip passages or the OA Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider 'interesting' or 'surprising'. (This is related to the question of 'if predicting tokens induces intelligence, and LLMs are now superhuman at predicting random Internet tokens, why are LLMs still not superhumanly intelligent?') People also try running compression metrics on programming language source code, and you get results like "Javascript is the best programming language", which is... counterintuitive, to say the least. So I am unsure his compression metrics would work without a lot of revising, while my proposed metrics seem a lot less risky and to map more directly onto what creative thinkers want out of generative models.
Replies from: wassname, wassname, anaguma↑ comment by wassname · 2024-12-20T00:35:41.303Z · LW(p) · GW(p)
I pretty much agree, in my experiments I haven't managed to get a metric that scales how I expect it too for example when using adapter fine-tuning to "learn" a text and looking at the percent improvement in perplexity, the document openai_board_ann
appeared more novel than wikipedia on LK-99
, but I would expect it to be the other way round since the LK-99 observations are much more novel and dense than a corporate announcement that is designed to be vague.
However I would point out that gzip is not a good example of a compression scheme for novelty, as 1) it's a compression scheme that roughly about word duplication. A language model represents a much more sophisticated compression scheme that is closer to our understanding the text. If we want to measure novelty to us, then we probably want a compression that is similar to how our brain compresses information into memory. That way, something surprising to us, is also hard to compress. And I'd also point out that 2) gzip cannot learn (except in a very basic sense of increased context), so it cannot beat the noisy TV problem.
Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider 'interesting' or 'surprising'.
I agree, but it doesn't learn so it doesn't get past the noisy TV problem either, but that is central to Schmidhuber idea. If you are not familiar, the noisy TV problem is this:
"agents are rewarded for visiting regions of the state space that they have not previously occupied. If, however, a particular state transition is impossible to predict, it will trap a curious agent (Burda et al., 2019b; Schmidhuber, 1991a). This is referred to as the noisy TV problem (e.g. (Burda et al., 2019b; Schmidhuber, 1991a)), the etymology being that a naively curious agent could dwell on the unpredictability of a noisy TV screen" from "How to Stay Curious while avoiding Noisy TVs using Aleatoric Uncertainty Estimation"
So I am unsure his compression metrics would work without a lot of revising, while my proposed metrics seem a lot less risky and to map more directly onto what creative thinkers want out of generative models.
I agree, this is true of most of Schmidhuber ideas. Often he does even produce a toy model for years, which means the ideas are generally not very useful. I do like this one, and it has led to some implementations in RL.
I do agree, perplexity doesn't seem like a great place to start, and your ideas seem like a better way to measure.
↑ comment by wassname · 2024-12-19T01:45:22.102Z · LW(p) · GW(p)
While I broadly agree, I don't think it's completely dead, just mostly dead in the water. If an eval is mandated by law, then it will be run even it required logprobs. There are some libraries like nnsight that try to make this easier for trusted partners to run logprob evals remotely. And there might be privacy preserving API's at some point.
I do agree that commercial companies will never again open up raw logprobs to the public as it allows easy behaviour cloning, which OpenAI experienced with all the GPT4 students.
Replies from: gwern↑ comment by gwern · 2024-12-19T19:20:00.515Z · LW(p) · GW(p)
If an eval is mandated by law, then it will be run even it required logprobs.
I won't hold my breath.
I think commercial companies often would open up raw logprobs, but there's not much demand, the logprobs are not really logprobs, and the problem is the leading model owners won't do so, and those are the important ones to benchmark. I have little interest in the creativity of random little Llama finetunes no one uses.
Replies from: wassname↑ comment by anaguma · 2024-12-18T23:04:37.552Z · LW(p) · GW(p)
I believe that the OAI api does offer the logprobs in the chat completions api (https://platform.openai.com/docs/api-reference/chat/create). Not sure about Anthropic.
Replies from: wassname↑ comment by wassname · 2024-12-18T23:37:18.612Z · LW(p) · GW(p)
If true, returns the log probabilities of each output token returned in the content of message.
It seems like it only returns the logprobs of the chosen message, not of a counterfactual message. So you couldn't get the probabilities of the correct answer, only the output answer. This makes sense as the less information they offer, the harder it is for a competitor to behaviour clone their confidential model.