Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

post by Haoxing Du (haoxing-du), Buck · 2022-10-12T21:25:00.459Z · LW · GW · 11 comments

Contents

  Examples
  Notes
None
11 comments

Some of Redwood’s current research involves finding specific behaviors that language models exhibit, and then doing interpretability to explain how the model does these behaviors. One example of this is the indirect object identification (IOI) behavior, investigated in a forthcoming paper of ours: given the input When John and Mary went to the store, Mary gave a flower to, the model completes John instead of Mary. Another example is the acronym generation task: given the input In a statement released by the Big Government Agency (, the model completes BGA)
 

We are considering scaling up this line of research a bunch, and that means we need a lot more behaviors to investigate! The ideal tasks that we are looking for have the following properties:

  1. The task arises in a subset of the training distribution. Both the IOI and the acronym tasks are abundant in the training corpus of language models. This means that we are less interested in tasks specific to inputs that never appear in the training distribution.
    1. The ideal task can even be expressed as a regular expression that can be run on the training corpus to obtain the exact subset, as is the case for acronyms. The IOI task is less ideal in this sense, since it is harder to identify the exact subset of the training distribution that involves IOI.
  2. There is a simple heuristic for the task. For IOI, the heuristic is “fill in the name that appeared only once so far”. For acronyms, the heuristic is “string together the first letter of each capitalized word, and then close the parentheses”. Note that the heuristic does not have to apply to every instance of the task in the training distribution, e.g. sometimes an acronym is not formed by simply taking the first letter of each word.
    1. The gold standard here is if the heuristic can be implemented in an automated way, e.g. as a Python function, but we would also consider the task if a human is needed to supply the labels.
  3. GPT-2 small implements this heuristic. We are focusing on the smallest model in the GPT-2 family right now, which is a 117M parameter model that is frankly not that good at most things. We are moderately interested in tasks that bigger models can do that GPT-2 small can’t, but the bar is at GPT-2 small right now.

Examples

The following is a list of tasks that we have found so far/are aware of. Induction and acronym generation remain the tasks that best meet all of the above desiderata.
 

Some examples that we are less excited about include:

We would love for interested people to contribute ideas! Below are some resources we put together to make the search as easy as possible:

Notes

11 comments

Comments sorted by top scores.

comment by Unnamed · 2022-10-13T00:11:14.815Z · LW(p) · GW(p)

Unit conversion, such as

"Fresno is 204 miles (329 km) northwest of Los Angeles and 162 miles (" -> 261 km)

"Fresno is 204 miles (329 km) northwest of Los Angeles and has an average temperature of 64 F (" -> 18 C)

"Fresno is 204 miles (" -> 329 km)

Results: 1, 2, 3. It mostly gets the format right (but not the right numbers).

Replies from: haoxing-du
comment by Haoxing Du (haoxing-du) · 2022-10-17T06:05:58.505Z · LW(p) · GW(p)

This is an interesting one! It looks like there might be some very rough heuristics going on for the number part as well, e.g. the model knows the number in km is almost definitely 3 digits.

comment by Logan Riggs (elriggs) · 2022-10-13T17:07:24.851Z · LW(p) · GW(p)

Reversing text w/ 1 example:

"Mike is large -> large is Mike
Bob is cute -> cute is"

Also works w/ numbers (but I had trouble getting it to reverse 3 digits at a time):
"3 6 -> 6 3
2 88 ->"

Ignoring a zero

"1 + 1 = 0 + 2
2 + 2 = 0 + 4
3 + 3 = 0 +"

Which also worked when replacing 0 w/ "pig", but changing it to "df" made it predict " 5" as the answer, which I think it just wants to count up from the previous answer 4.

Parallel structure w/ Independent

For each of the following, the model predicts a "." at the end. 

I eat spaghetti, yet she eats pizza
I slept, for I was sleepy
I can eat, or I can sleep
I love my dog, and my dog loves me
I can neither eat, nor can I sleep

Some n-grams overpower this effect. In the above "yet she eats ice" will be followed by " cream". "I was sleepy, so I slept" will be followed by " in".

Three Items in a List

"She ate the cookies, cake" will be followed by "," and then " and".

 

[Note: the language modelling game and the gpt-2 small search tool a were very useful]

Replies from: haoxing-du
comment by Haoxing Du (haoxing-du) · 2022-10-17T06:09:44.585Z · LW(p) · GW(p)

Thanks for contributing these! I'm not sure I understand the one about ignoring a zero: is the idea that it can not only do normal addition, but also addition in the format with a zero?

comment by Chase Carter · 2022-10-14T20:22:24.187Z · LW(p) · GW(p)

Completing Incomplete Quotations

Pattern: ["<incomplete quoted statement>," <descriptor of speaker> said,] -> ["<completion of sentence following from previous quotation><...>]
Example: ["When the truth is replaced by silence," the Soviet dissenter said,] -> [ "it will be impossible to hold securely everything.] (prediction starts with [ "] ~71% of the time)

The next token will be [ "] ~45-70% of the time when the original quotation is obviously incomplete.
When the original quotation looks more like a complete sentence, the next token will be [ "] only ~5-20% of the time (see counterexample below).

Counter Example (initial quotation is a complete statement; in this case removed 'When'):
["The truth is replaced by silence," the Soviet dissenter said,] -> [adding that the TV show was a farcical] (prediction starts with [ "] only ~12% of the time)

'From' - 'To' Numeric Symmetry

Pattern: [from <member of numeric class> to] -> [ <different member of numeric class>]
Examples:
[from 1874 to] -> [ 1882]
[from March 34, 1999 to] -> [ May 12, 2004]
[from 5:40 am to] -> [ 8:00 am]
[from 30 degrees to] -> [ 100 degrees]
[from 89 to] -> [ 93]
[from 154 to] -> [ 195]
[from 12539 to] -> [ 13114]
[from 2,631,254,399 to] -> [ 3,021,133,526]

Maintains symmetry between plausible years/dates/times/temperatures. In the case of dates/times is heavily biased towards predicting a higher value after 'to' (as would be expected from the training corpus). Also maintains symmetry of number of digits in arbitrary numbers that don't fall into an obvious class, though this starts losing exactness past 5 digits (but still remains roughly symmetric). Interestingly, exactness of number of digits for larger numbers improves substantially when commas are added to the number (e.g. 1,000,000).

Syntactically Correct HTTP URL Generation/Completion

Pattern: [https://] -> [<syntactically valid and real-looking URL containing a domain, resource, sometimes query parameters, etc>]
Examples:
[https://] -> [www.parks.org/programs/]
[http://wowthisissocool] -> [380.blogspot.com/2015/03/]
[https://ibetthiswillgetqueryparams.com] -> [/submit?inc=false&type=Out]

Beyond being merely syntactically valid, common URL resource nesting patterns are observed, like the [/<year>/<month>/] pattern above, or [/<resource>/<id>].

Replies from: haoxing-du
comment by Haoxing Du (haoxing-du) · 2022-10-17T06:15:28.345Z · LW(p) · GW(p)

Thanks for these! I love the 'from' -> 'to' one: it seems GPT-2 small clearly knows the rough ordering of numbers in various formats, although when I was playing with it and trying to get it to do addition in real life settings, it appears quite bad at actually knowing how numbers work.

comment by Jérémy Scheurer (JerrySch) · 2022-10-13T11:27:44.591Z · LW(p) · GW(p)

"Either", "or" pairs in text. 
Heuristic. If the word either appears in a sentence, wait for the comma and then add an " or". 

What follows are a few examples. Note that the completion is just something I randomly come up with, the important part is the or. Using the webapp, GPT-2 puts a high probability (around 40%-60%) on the token " or". 

"Either you take a left at the next intersection," -> or take a left after that. 
"Either you go to the cinema," -> or you stay at home. 
"Tonight I could either order some food," -> or cook something myself.

Counter example: 
"Do you rather want to go to Portugal or Italy? Either" -> way is fine./one is fine. (GPT-2 puts a lot of probability on " way", and barely any on " or", which is correct).
 

Replies from: haoxing-du
comment by Haoxing Du (haoxing-du) · 2022-10-17T06:17:44.641Z · LW(p) · GW(p)

Thanks! There are probably other grammatical structures in English that require a bit of an algorithmic thinking like this one as well.

comment by lberglund (brglnd) · 2022-10-17T09:49:57.992Z · LW(p) · GW(p)

I found some behaviors, but I'm not sure this is what you are looking for because the algorithm in both is quite simple. I'd appreciate feedback on them.

Incrementing days of the week

"If today is Monday, tomorrow is Tuesday. If today is Wednesday, tomorrow is" -> "Thursday" 

"If today is Monday, tomorrow is Tuesday. If today is Thursday, tomorrow is" -> "Friday" 

etc.

This also works with zero-shot prompting although the effect isn't as strong. eg:

"If today is Friday, tomorrow is" -> "Saturday"

Inferring gender

"Lisa is great. I really like" -> "her"

"John is great. I really like" -> "him"

etc.

comment by lberglund (brglnd) · 2022-10-16T10:46:21.454Z · LW(p) · GW(p)

[Typo]

Replies from: haoxing-du