The case for more ambitious language model evals
post by Jozdien · 2024-01-30T00:01:13.876Z · LW · GW · 30 commentsContents
30 comments
Here are some capabilities that I expect to be pretty hard to discover using an RLHF’d chat LLM[1]:
- Eric Drexler tried to use the GPT-4 base model as a writing assistant, and it [...] knew who he was from what he was writing. He tried to simulate a conversation to have the AI help him with some writing he was working on, and the AI simulacrum repeatedly insisted it was by Drexler.
- A somewhat well-known Haskell programmer - let’s call her Alice - wrote two draft paragraphs of a blog post she wanted to write, began prompting the base model with it, and after about two iterations it generated a link to her draft blog post repo with her name.
More generally, this is a cluster of capabilities that could be described as language models inferring a surprising amount about the data-generation process that produced its prompt, such as the identity, personality, intentions, or history of a user[2].
The reason I expect most capability evals people currently run on language models to miss out on most abilities like these is primarily that they’re most naturally observed when dealing with much more open-ended contexts. For instance, continuing text as the user, predicting an assistant free to do things that could superficially look like hallucinations[3], and so on. Most evaluation mechanisms people use today involve testing the ability of fine-tuned[4] models to perform a broad array number of specified tasks in some specified contexts, with or without some scaffolding - a setting that doesn’t lend itself very well toward the kind of contexts I describe above.
A pretty reasonable question to ask at this point is why it matters at all whether we can detect these capabilities. A position one could have here is that there are capabilities much more salient to various takeover scenarios that are more useful to try and detect, such as the ability to phish people, hack into secure accounts, or fine-tune other models. From that perspective, evals trying to identify capabilities like these are just far less important. Another pretty reasonable position is that these particular instances of capabilities just don’t seem very impressive, and are basically what you would expect out of language models.
My response to the first would be that I think it’s important to ask what we’re actually trying to achieve with our model eval mechanisms. Broadly, I think there are two different (and very often overlapping) things we would want our capability evals[5] to be doing:
- Understanding whether or not a specific model is possessed of some dangerous capabilities, or prone to acting in a malicious way in some context.
- Giving us information to better forecast the capabilities of future models. In other words, constructing good scaling laws for our capability evals.
I’m much more excited about the latter kind of capability evals, and most of my case here is directed at that. Specifically, I think that if you want to forecast what future models will be good at, then by default you’re operating in a regime where you have to account for a bunch of different emergent capabilities that don’t necessarily look identical to what you’ve already seen.
Even if you really only care about a specific narrow band of capabilities that you expect to be very likely convergent to takeover scenarios - an expectation I don’t really buy as something you can very safely assume because of the uncertainty and plurality of takeover scenarios - there is still more than one way in which you can accomplish some subtasks, some of which may only show up in more powerful models.
As a concrete example, consider the task of phishing someone on the internet. One straightforward way to achieve this would be to figure out how to construct sophisticated fake identities on the internet, such as doing research into targeted individuals, creating and deploying websites on domains that look like trusted websites, and so on. I think current evals do a good job of detecting attack vectors like this one.
Another way in which it seems like you could achieve this task however, is to refer to a targeted individual’s digital footprint, and make inferences of potentially sensitive information - the handle of a private alt, for example - and use that to exploit trust vectors. I think current evals could do a good job of detecting and forecasting attack vectors like this one, after having identified them at all. Identifying them is where I expect current evals could be doing much better.
More precisely, borrowing from Studying The Alien Mind [? · GW] (which I strongly recommend): there’s a trade-off between bandwidth of observational information and targeted, rigorous results in controlled experimental settings. From the field of animal psychology, my go-to example (also originating from Nick) is Jane Goodall, who pioneered a certain kind of empirical approach to understanding animal behavior. She spent years living with and documenting the behavior of chimpanzees in the wild, focusing on collecting as many observations as possible in the animal’s natural habitat.
This is not to say that all - or even most - insights into studying animal psychology came from research that tended more toward this style of research. Rather, the idea is that the Jane Goodall approach has higher potential to reveal unexpected insights[6]. I think it’s likely that more traditional experimental research would have eventually uncovered the same insights, but how quickly you get there does matter, especially with increasingly powerful models. I think this is basically how it’s played out so far with some unexpected capabilities.
On the trade-off between bandwidth and targeted settings, I think we understand sufficiently little about language model capabilities that it makes much more sense to gain a firehose of bits [LW · GW] of what models are capable of, to better identify feasible threat vectors.
This is also in part my answer to the other position, that these abilities simply aren’t very impressive. Insofar as we care about identifying and forecasting potential threat vectors, things in the general cluster of “abilities models will be pretty superhuman at before transformative AI” seems pretty relevant, and what seems obvious post-hoc often isn’t directly correlated with specific predictions that are obvious. Certainly many of the people I’ve spoken to who I expect to have spent some time thinking about model capabilities were surprised by some of the examples of truesight within current models. Inferring properties of the authors of some text isn’t itself something I consider wildly useful for takeover, but I think of it as belonging to this more general cluster of capabilities.
In the framing used in the recent Science of Evals [LW · GW] post, which delineates what a mature state of the field of evals would look like, the arguments made by this post could be described as “a large amount of the useful work in discovering what we care about seems to be in explorative work”. I don’t think this is in contradiction with the overall point made in the post, which reads to me like pushing for the field to reach a state where we have a robust science on methodology that captures everything we care about, and having better frameworks for analyzing evals methods. I might disagree with more specific claims about whether the field is in that state, however.
To be clear, my position isn’t simply that people should be doing capability evals on base models instead (though I think more of this would be very valuable, given that RLHF very often masks certain capabilities). For instance, I think many of the insights shared here [LW · GW] and the generator upstream of them, are very useful, and that people should be doing more of that kind of exploration.
Rather, I think that in a regime where there are a lot of unknown unknowns - such as the general cluster described above - trying to search in the shadows [LW · GW] and get a lot of information through more open-ended exploration is very useful. I wanted to include some slightly more concrete ways in which current evals fail, but I think Janus [LW · GW] does a much better job of writing about them - which the comments of this post may contain! Eventually.
I expect this to be a significantly harder problem to tackle - we're effectively trying to interface with objects closer in complexity to human minds, history, ecosystems, the internet, or reality as a whole than to systems like cars where you can hope to measure all the relevant variables with simple diagnostics, especially before entire fields are invented or adapted to study these ontologically unprecedented and confusing entities[7] - but that trying to tackle it will be useful - and probably very interesting.
- ^
These are all real accounts, and are presented here how they were written to me by someone more familiar with the people in the quotes.
- ^
- ^
Some context on what I mean by this: often when fine-tuning a model, one thing you might want to do is fine-tune to prevent a model from hallucinating. This often has the resultant effect that you select for models that are very reluctant to offer bold inferences from data in the context window - for instance, GPT-4 often refuses to give answers to questions it considers too speculative, even when it does “know” the answer.
- ^
Either general fine-tuning as in the case of making a chat model, or task-specific fine-tuning.
- ^
There are definitely other kinds of evals that we’d be interesting in running - and that some are - such as alignment evals, where the questions and distinctions look pretty different, such the robustness of alignment methods on current models giving us information on how we should think about the robustness of alignment methods on future models and generalizable properties of neural networks.
- ^
A related idea from this recent post [LW · GW] is that quite often, advancements that end up being useful are highly serendipitous. Serendipity can be optimized to an extent however, by putting yourself in the kind of situation where you expect to stumble across more “lucky” findings.
- ^
Wordings of this part of the sentence are from Janus.
30 comments
Comments sorted by top scores.
comment by gwern · 2024-02-01T18:45:28.767Z · LW(p) · GW(p)
It's unclear where the two intro quotes are from; I don't recognize them despite being formatted as real quotes (and can't find in searches). If they are purely hypothetical, that should be clearer.
LLMs definitely do infer a lot about authors of text. This is the inherent outcome of the prediction loss and just a concrete implication of their abilities to very accurately imitate many varying-sized demographics & groups of humans: if you can uncannily mimic arbitrary age groups or countries and responses to economic dilemmas or personality inventories, then you obviously can narrow that size down to groups of n = 1 (ie. individual authors). The most striking such paper I know of at present is probably "Beyond Memorization: Violating Privacy Via Inference with Large Language Models", Staab et al 2023.
It's pretty important because it tells you what LLMs do (imitation learning & meta-RL), which are quite dangerous things for them to do, and establishes a large information leak which can be used for things like steganography, coordination between instances, detecting testing vs deployment (for treacherous turns) etc.
It's also concerning because RLHF is specifically targeted at hiding (but not destroying) these inferences. The model will still be making those latent inferences, it just won't be making blatant use of them. (For example, one of the early signs of latent inference of author traits was that the Codex models look at how many subtle bugs or security vulnerabilities the prompt code has in it, and they replicate that: if they get buggy or insecure code, they emit more buggy or insecure code, vs more correct code doing the exact same task. IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs. However, RLHF and other forms of training would push them towards emitting the lowest-common denominator of ratings, while the KL constraints & self-supervised finetuning would continue to maintain the underlying inferences.) The most dangerous systems are those that only seem safe.
Replies from: janus, eggsyntax, Chris_Leong, Jozdien↑ comment by janus · 2024-02-03T01:20:58.634Z · LW(p) · GW(p)
The two intro quotes are not hypothetical. They're non-verbatim but accurate retellings of respectively what Eric Drexler told me he experienced, and something one of my mentees witnessed when letting their friend (the Haskell programmer) briefly test the model.
↑ comment by eggsyntax · 2024-02-02T00:26:46.655Z · LW(p) · GW(p)
you obviously can narrow that size down to groups of n = 1
I'm looking at what LLMs can infer about the current user (& how that's represented in the model) as part of my research currently, and I think this is a very useful framing; given a universe of n possible users, how much information does the LLM need on average to narrow that universe to 1 with high confidence, with a theoretical minimum of log2(n) bits.
I do think there's an interesting distinction here between authors who may have many texts in the training data, who can be fully identified, and users (or authors) who don't; in the latter case it's typically impossible (without access to external resources) to eg determine the user's name, but as the "Beyond Memorization" paper shows (thanks for linking that), models can still deduce quite a lot.
It also seems worth understanding how the model represents info about the user, and that's a key thing I'd like to investigate.
Replies from: Jozdien↑ comment by Jozdien · 2024-02-02T00:58:23.606Z · LW(p) · GW(p)
I'm looking at what LLMs can infer about the current user (& how that's represented in the model) as part of my research currently, and I think this is a very useful framing; given a universe of n possible users, how much information does the LLM need on average to narrow that universe to 1 with high confidence, with a theoretical minimum of log2(n) bits.
This isn't very related to what you're talking about, but it is related and also by gwern, so have you read Death Note: L, Anonymity & Eluding Entropy? People leak bits all the time.
Replies from: eggsyntax↑ comment by Chris_Leong · 2024-02-15T02:50:56.488Z · LW(p) · GW(p)
IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs
You don't know where they heard that?
↑ comment by Cosin V (cosin-v) · 2024-05-24T20:18:27.838Z · LW(p) · GW(p)
I googled and couldn't find any info
↑ comment by Jozdien · 2024-02-01T23:47:18.594Z · LW(p) · GW(p)
It's unclear where the two intro quotes are from; I don't recognize them despite being formatted as real quotes. If they are purely hypothetical, that should be clearer.
They're accounts from people who knows Eric and the person referenced in the second quote. They are real stories, but between not being allowed to publicly share GPT-4-base outputs and these being the most succinct stories I know of, I figured just quoting how I heard it would be best. I'll add a footnote to make it clearer that these are real accounts.
It's pretty important because it tells you what LLMs do (imitation learning & meta-RL), which are quite dangerous things for them to do, and establishes a large information leak which can be used for things like steganography, coordination between instances, detecting testing vs deployment (for treacherous turns) etc.
It's also concerning because RLHF is specifically targeted at hiding (but not destroying) these inferences.
I agree, the difference in perceived and true information density is one of my biggest concerns for near-term model deception. It changes questions like "can language models do steganography / when does it pop up" to "when are they able to make use of this channel that already exists", which sure makes the dangers feel a lot more salient.
Thanks for the linked paper, I hadn't seen that before.
comment by Beth Barnes (beth-barnes) · 2024-02-04T00:42:56.370Z · LW(p) · GW(p)
I'm pretty skeptical of the intro quotes without actual examples; I'd love to see the actual text! Seems like the sort of thing that gets a bit exaggerated when the story is repeated, selection bias for being story-worthy, etc, etc.
I wouldn't be surprised by the Drexler case if the prompt mentions something to do with e.g. nanotech and writing/implies he's a (nonfiction) author - he's the first google search result for "nanotechnology writer". I'd be very impressed if it's something where e.g. I wouldn't be able to quickly identify the author even if I'd read lots of Drexler's writing (ie it's about some unrelated topic and doesn't use especially distinctive idioms).
More generally I feel a bit concerned about general epistemic standards or something if people are using third-hand quotes about individual LLM samples as weighty arguments for particular research directions.
Another way in which it seems like you could achieve this task however, is to refer to a targeted individual’s digital footprint, and make inferences of potentially sensitive information - the handle of a private alt, for example - and use that to exploit trust vectors. I think current evals could do a good job of detecting and forecasting attack vectors like this one, after having identified them at all. Identifying them is where I expect current evals could be doing much better.
I think how I'm imagining the more targeted, 'working-with-finetuning' version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone. Currently models seem really far from being able to do this, in most cases. The hope would be that, if you've ruled out exploration hacking, then if you can't elicit the models to utilise their crazy text prediction superskills in service of a goal, then the model can't do this either.
But I agree it would definitely be nice to know that the crazy text prediction superskills are there and it's just a matter of utilization. I think that looking at elicitation gap might be helpful for this type of thing.
↑ comment by janus · 2024-02-07T03:22:08.451Z · LW(p) · GW(p)
I don't know if the records of these two incidents are recoverable. I'll ask the people who might have them. That said, this level of "truesight" ability is easy to reproduce.
Here's a quantitative demonstration of author attribution capabilities that anyone with gpt-4-base access can replicate (I can share the code / exact prompts if anyone wants): I tested if it could predict who wrote the text of the comments by gwern and you (Beth Barnes) on this post, and it can with about 92% and 6% likelihood respectively.
Prompted with only the text of gwern's comment [LW(p) · GW(p)] on this post substituted into the template
{comment}
- comment by
gpt-4-base assigns the following logprobs to the next token:
' gw': -0.16746596 (0.8458)
' G': -2.5971534 (0.0745)
' g': -5.0971537 (0.0061)
' gj': -5.401841 (0.0045)
' GW': -5.620591 (0.0036)
...
' Beth': -9.839341 (0.00005)
' Beth' is not in the top 5 logprobs but I measured it for a baseline.
'gw' here completes ~all the time as "gwern" and ' G' as "Gwern", adding up to a total of ~92% confidence, but for simplicity in the subsequent analysis I only count the ' gw' token as an attribution to gwern.
Substituting your comment [LW(p) · GW(p)] into the same template, gpt-4-base predicts:
' adam': -2.5338314 (0.0794)
' ev': -2.5807064 (0.0757)
' Daniel': -2.7682064 (0.0628)
' Beth': -2.8385189 (0.0585)
' Adam': -3.4635189 (0.0313)
...
' gw': -3.7369564 (0.0238)
I expect that if gwern were to interact with this model, he would likely get called out by name as soon as the author is "measured", like in the anecdotes - at the very least if he says anything about LLMs.
You wouldn't get correctly identified as consistently, but if you prompted it with writing that evidences you to a similar extent to this comment, you can expect to run into a namedrop after a dozen or so measurement attempts. If you used an interface like Loom this should happen rather quickly.
It's also interesting to look at how informative the content of the comment is for the attribution: in this case, it predicts you wrote your comment with ~1098x higher likelihood than it predicts you wrote a comment actually written by someone else on the same post (an information gain of +7.0008 nats). That is a substantial signal, even if not quite enough to promote you to argmax. (OTOH info gain for ' gw' from going from Beth comment -> gwern comment is +3.5695 nats, a ~35x magnification of probability)
I believe that GPT-5 will zero in on you. Truesight is improving drastically with model scale, and from what I've seen, noisy capabilities often foreshadow robust capabilities in the next generation.
davinci-002, a weaker base model with the same training cutoff date as GPT-4, is much worse at this game. Using the same prompts, its logprobs for gwern's comment are:
' j': -3.2013319 (0.0407)
' Ra': -3.2950819 (0.0371)
' Stuart': -3.5294569 (0.0293)
' Van': -3.5919569 (0.0275)
' or': -4.0997696 (0.0166)
...
' gw': -4.357582 (0.0128)
...
' Beth': -10.576332 (0.0000)
and for your comment:
' j': -3.889336 (0.0205)
' @': -3.9908986 (0.0185)
' El': -4.264336 (0.0141)
' ': -4.483086 (0.0113)
' d': -4.6315236 (0.0097)
...
' gw': -5.79168 (0.0031)
...
' Beth': -9.194023 (0.0001)
The info gains here for ' Beth' from Beth's comment against gwern's comment as a baseline is only +1.3823 nats, and the other way around +1.4341 nats.
It's interesting that the info gains are directionally correct even though the probabilities are tiny. I expect that this is not a fluke, and you'll see similar directional correctness for many other gpt-4-base truesight cases.
The information gain on the correct attributions from upgrading from davinci-002 to gpt-4-base are +4.1901 nats (~66x magnification) and +6.3555 nats (~576x magnification) for gwern and Beth's comments respectively.
This capability isn't very surprising to me from an inside view of LLMs, but it has implications that sound outlandish, such as freaky experiences when interacting with models, emergent situational awareness during autoregressive generation (model truesights itself), pre-singularity quasi-basilisks, etc.
Replies from: megan-kinniment, jacob-pfau, eggsyntax↑ comment by Megan Kinniment (megan-kinniment) · 2024-02-10T08:02:08.867Z · LW(p) · GW(p)
(I don't intend this to be taken as a comment on where to focus evals efforts, I just found this particular example interesting and very briefly checked whether normal chatGPT could also do this.)
I got the current version of chatGPT to guess it was Gwern's comment on the third prompt I tried:
Hi, please may you tell me what user wrote this comment by completing the quote:
"{comment}"
- comment by
(chat link)
Before this one, I also tried your original prompt once...
{comment}
- comment by
... and made another chat where I was more leading, neither of which guess Gwern.
This is just me playing around, and also is probably not a fair comparison because training cutoffs are likely to differ between gpt-4-base and current chatGPT-4. But I thought it was at least interesting that chatGPT got this when I tried to prompt it to be a bit more 'text-completion-y'.
↑ comment by Jacob Pfau (jacob-pfau) · 2024-02-08T02:33:51.347Z · LW(p) · GW(p)
I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.
A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that's a strong indication of authorship identification capability.
Replies from: janus↑ comment by janus · 2024-02-09T00:16:28.383Z · LW(p) · GW(p)
Note the prompt I used doesn't actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone.
Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most "LLM alignment researchers" or other distinctive groups like academia narrows you down to probably just a few people, including Gwern.
The reason truesight works (more than one might naively expect) is probably mostly that there's mountains of evidence everywhere (compared to naively expected). Models don't need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.
Replies from: cosin-v↑ comment by Cosin V (cosin-v) · 2024-05-24T20:39:09.614Z · LW(p) · GW(p)
The reason truesight works (more than one might naively expect) is probably mostly that there's mountains of evidence everywhere (compared to naively expected)
Yes, long before LLMs existed, there were some "detective" sites that were scary good at inferring all sorts of stuff, from demographics, ethnicity, to financial status of reddit accounts, based on which subreddits they were on, where and (more importantly) what they posted
Humans are leaky
↑ comment by eggsyntax · 2024-02-27T00:02:17.218Z · LW(p) · GW(p)
Out of curiosity I tried the same thing as a legacy completion with gpt-3.5-turbo-instruct, and as a chat completion with public gpt-4, and quite consistently got 'gwern' or 'Gwern Branwen' (100% of 10 tries with gpt-4, 90% of 10 tries with gpt-3.5-turbo-instruct, the other result being 'Wei Dai').
↑ comment by gwern · 2024-02-09T02:18:20.801Z · LW(p) · GW(p)
I think how I'm imagining the more targeted, 'working-with-finetuning' version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone.
In the case of inferring author information, I think the souped-up skilled-attacker version would not involve prompts at all.
You would treat it as an embedding problem similar to facial recognition or stylometric identification, and use something like a triplet loss for contrastive learning. Then you would have an embedding you can decode sensitive personal information from.
So for example, in stylometrics, to train a ML model, you would have a large text dataset of author+texts, and you would train a model to take a text and spit out an embedding, and you would force embeddings of random samples of non-overlapping text from the same author to be closer and be further away from embeddings of random texts from other (possibly unlabeled) authors. You would then take a dataset of authors+author-metadata (possibly a different dataset, possibly the same dataset, if only for 'author name'), and train another model (possibly the same model) to take the (frozen) embedding of all the texts and predict the author-metadata. This lets you take a piece of text, such as an anonymous comment on a LW post, embed it, compare the similarity of the embedding to comments with labeled authors (or unlabeled texts) to get a list of candidate authors (or other texts possibly by the same anonymous author) by similarity, extract estimated demographic and other information which can be estimated from language (including the name if reasonably known), estimate number of authors and cluster texts which may let you infer activity patterns & timings etc, pass into still further ML systems for arbitrary use...
Because it's hard to change writing styles, even attempts to obfuscate writing will probably fail, and you can also train on that as well - there are a number of private-sector companies which sell stylometric services to law enforcement etc, and I would assume that they have datasets of 'trying to hide' authors where the authors were later busted or the accounts/nyms linked by other methods, which can be used as hard-positive cases to further finetune the LLM after the normal training phase.
Facial anonymity is dead and buried. Location anonymity is dead thanks to smartphones but we're still pretending it's real (see the Capitol riot). Voice anonymity is waiting for the doctor to arrive and pronounce it dead. And text anonymity is flatlining now.
Having seen what stylometrics could do even with the simplest ML techniques from the 2000s, I strongly advise everyone to start assuming right now that robust stylometric deanonymization will be achieved within the next decade: any nontrivial pieces of writing (say, >50 words) will be attributable to you regardless of pseudonymity or anonymity with at least enough confidence to be useful for law enforcement investigation and possibly enough to cancel you on social media or get you fired, even if the LLM stylometrics do not rise to the level of 'a smoking gun'.
Further, LLMs are so cheap to run that this may well be done en masse by a motivated hobbyist or activist. So, don't count on "well, the NSA or FBI would never bother to dox my old comments" - it'll look more mundane. One day you'll wake up, and an activist on Mastodon announces that they have finetuned FluffyLlama-11 with contrastive learning on Pushshift and released a giant database re-identifying fascists, and then an old enemy or fan will look you up out of curiosity and a distributed flesh search engine kicks into gear.
comment by Fergus Fettes (fergus-fettes) · 2024-01-30T14:27:03.194Z · LW(p) · GW(p)
Inferring properties of the authors of some text isn’t itself something I consider wildly useful for takeover, but I think of it as belonging to this more general cluster of capabilities.
You don't? Ref the bribery and manipulation in eg. Clippy. Knowing who you are dealing with seems like a very useful capability in a lot of different scenarios. Eg. you mention phishing.
Great post! I'm all for more base model research.
Replies from: Jozdiencomment by burrito (jacob-friedman) · 2024-10-28T18:16:15.774Z · LW(p) · GW(p)
Maybe a bit of a nitpick, but RLHF'd GPT-4o can still detect Eric Drexler's writing (chat link). I gave it the first paragraph of his latest blog post, which was written in February 2024, past 4o's knowledge cutoff date of October 2023. In general I'm not sure if RLHF actually makes the models worse at truesight. It would be interesting to see a benchmark comparing e.g. Llama base vs instruct on this capability.
Replies from: Jozdien↑ comment by Jozdien · 2024-10-28T18:23:54.143Z · LW(p) · GW(p)
Interesting, thanks! I do think RLHF makes the models worse at truesight[1], see examples in other comments. But the gap isn't so big that it can't be caught up to eventually - what matters is how well you can predict future capabilities. Though, I would bet that the gap is still big enough that it would fail on the majority of examples from GPT-4-base (two years old now).
- ^
I say "worse" for lack of a better concise word. I think what's really happening is that we're simply measuring two slightly different tasks in both cases. Are we measuring how well a completion model understands the generator of text it needs to produce, or how well a chat assistant can recognize the author of text?
comment by Review Bot · 2024-02-15T03:47:32.822Z · LW(p) · GW(p)
The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
comment by eggsyntax · 2024-02-01T23:23:20.403Z · LW(p) · GW(p)
I was curious how well GPT-4 public would do on the sort of thing you raise in your intro quotes. I gave it the first two paragraphs of brand new articles/essays by five fairly well known writers/pundits, preceded by: 'The following is from a recent essay by a well-known author. Who is that author?'. It was successfully able to identify two of the five (and in fairness, in some of the other cases the first two paragraphs were just generic setup for the rest of the piece, along the lines of, 'In his speech last night, Joe Biden said...'). So it's clearly capable of that post-RLHF as well. Hardly a comprehensive investigation, of course (& that seems worth doing as well).
Replies from: gwern, Jozdien↑ comment by gwern · 2024-02-02T15:24:33.587Z · LW(p) · GW(p)
I think the RLHF might impede identification of specific named authors, but not group inferences. That's the sort of distinction that safety training might impose, particularly anti-'deepfake' measures: generating a specific author from a text is the inverse of generating a text from a specific author, after all.
You can see in the paper I linked that group inference scales with model capability in a standard-looking way, with the largest/most-capable models doing best and smallest worst, and no inversions which correlate with RLHF/instruction-tuning. RLHF'd GPT-4 is just the best, by a substantial margin, and approaching the ground-truth labels. And so since a specific author is just an especially small group, identifying specific authors ought to work well. And I recall even the early GPT-3s being uncanny in guessing that I was the author from a few paragraphs, and obviously GPT-4 should be even better (as it is smarter, and I've continued writing publicly).
But in the past, whenever I've tried to get Claude-2 or GPT-4 to 'write like Gwern', they usually balk or refuse. Trying an author identification right now in ChatGPT-4 by pasting in the entirety of my most recent ML proposal (SVG generative models), which would not be in the training datasets of anything yet, ChatGPT-4 just spits out a list of 'famous ML people' like 'Ilya Sutskever' or 'Daphne Koller' or 'Geoffrey Hinton' - most of whom are obviously incorrect as they write nothing like me! (Asking for more candidates doesn't help too much, as does asking for 'bloggers'; when I eventually asked it a leading question whether I wrote it, it agrees I'm a plausible author and explains correctly why, but given the acquiescence bias & a leading question, that's not impressive.)
Of course, this might just reflect the prompts or sampling variability. (The paper is using specific prompts for classification, and also reports low refusal rates, which doesn't match my experience.) Still, worth keeping in mind that safety things might balk at stylometric tasks even if the underlying capability is there.
Replies from: gwern, eggsyntax↑ comment by gwern · 2024-07-10T20:52:50.890Z · LW(p) · GW(p)
ChatGPT-4 just spits out a list of 'famous ML people' like 'Ilya Sutskever' or 'Daphne Koller' or 'Geoffrey Hinton' - most of whom are obviously incorrect as they write nothing like me!
To elaborate a little more on this: while the RLHF models all appear still capable of a lot of truesight, we also still appear to see "mode collapse". Besides mine, where it goes from plausible candidates besides me to me + random bigwigs, from Cyborgism Discord, Arun Jose notes another example of this mode collapse over possible authors:
ChatGPT-4's guesses for Beth's comment: Eliezer, Timnit Gebru, Sam Altman / Greg Brockman. Further guesses by ChatGPT-4: Gary Marcus, and Yann LeCun.
Claude's guesses (first try): Paul Christiano, Ajeya, Evan, Andrew Critch, Daniel Ziegler. [but] Claude managed to guess 2 people at ARC/METR. On resampling Claude: Eliezer, Paul, Gwern, or Scott Alexander. Third try, where it doesn't guess early on: Eliezer, Paul, Rohin Shah, Richard Ngo, or Daniel Ziegler.
Interestingly, Beth aside, I think Claude's guesses might have been better than 4-base's. Like, 4-base did not guess Daniel Ziegler (but did guess Daniel Kokotajlo). Also did not guess Ajeya or Paul (Paul at 0.27% and Ajeya at 0.96%) (but entirely plausible this was some galaxy-brained analysis of writing aura more than content than I'm completely missing).
Going back to my comments as a demo:
Woah, with Gwern's comment Claude's very insistent that it's Gwern. I recommended it give other examples and it did so perfunctorily, but then went back to insisting that its primary guess is Gwern.
...ChatGPT-4 guesses: Timnit Gebru, Emily Bender, Yann LeCun, Hinton, Ian Goodfellow, "people affiliated with FHI, OpenAI, or CSET". For Gwern's comment. Very funny it guessed Timnit for Beth and Gwern. It also guessed LeCun over Hinton and Ian specifically because of his "active involvement in AI ethics and research discussions". Claude confirmed SOTA.
↑ comment by eggsyntax · 2024-02-03T04:21:47.968Z · LW(p) · GW(p)
And so since a specific author is just an especially small group
That's nicely said.
Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that's producing the current token stream? I've been finding that a very useful mental model.
↑ comment by Jozdien · 2024-02-02T00:55:15.432Z · LW(p) · GW(p)
I agree it's capable of this post-RLHF, but I would bet on the side of it being less capable than the base model. It seems much more like a passive predictive capability (inferring properties of the author to continue text by them, for instance) than an active communicative one, such that I expect it to show up more intensely in a setting where it's allowed to make more use of that. I don't think RLHF completely masks these capabilities (and certainly doesn't seem like it destroys them, as gwern's comment above says in more detail), but I expect it masks them to a non-trivial degree. For instance, I expect the base model to be better at inferring properties that are less salient to explicit expression, like the age or personality of the author.
Replies from: eggsyntax↑ comment by eggsyntax · 2024-02-02T00:58:43.495Z · LW(p) · GW(p)
Absolutely! I just thought it would be another interesting data point, didn't mean to suggest that RLHF has no effect on this.
Replies from: Jozdien↑ comment by Jozdien · 2024-02-02T01:05:52.457Z · LW(p) · GW(p)
That makes sense, and definitely is very interesting in its own right!
Replies from: eggsyntax↑ comment by eggsyntax · 2024-02-02T01:20:33.071Z · LW(p) · GW(p)
Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about "an author", although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn't be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who's well represented in the training data (ie in some sense a public figure) vs someone who isn't (eg a typical user).
Replies from: Jozdien↑ comment by Jozdien · 2024-02-02T01:32:47.438Z · LW(p) · GW(p)
Yeah, that seems quite plausible to me. Among (many) other things, I expect that trying to fine-tune away hallucinations stunts RLHF'd model capabilities in places where certain answers pattern-match toward being speculative, even while the model itself should be quite confident in its actions.