Posts

OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/why it makes that discrimination, even as model scales 2022-06-13T23:33:13.049Z

Comments

Comment by Aditya Jain (aditya-jain) on Contra EY: Can AGI destroy us without trial & error? · 2022-06-14T18:52:53.602Z · LW · GW

I don't know, the bacteria example really gets me because working in biotech, it seems very possible and the main limitation is current lack of human understanding about all proteins' functions which is something we are actively researching if it can be solved via AI.

I imagine an AI roughly solving the protein function problem just as we have a rough solution for protein folding, then hacking a company which produces synthetic plasmids and slipping in some of its own designs in place of some existing orders. Then when those research labs receive their plasmid and transfect it into cells (we can't really see if the plasmid received was correct until this step is done), those cells go berserk and multiply like crazy, killing all humans. There are enough labs doing this kind of research on the daily that the AI would have plenty of redundancy built in and opportunities to try different designs simply by hacking a plasmid ordering company.

Comment by Aditya Jain (aditya-jain) on OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/why it makes that discrimination, even as model scales · 2022-06-14T04:46:39.185Z · LW · GW

I was trying to say that the gap between the two did not decrease with scale. Of course, raw performance increases with scale as gwern & others would be happy to see :)

Comment by Aditya Jain (aditya-jain) on OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/why it makes that discrimination, even as model scales · 2022-06-14T04:44:44.804Z · LW · GW

This makes sense in a pattern-matching framework of thinking, where both humans and AI can "feel in their gut" that something is wrong without necessarily being able to explain why. I think this is still concerning as we would ideally prefer AI which can explain its answers beyond knowing them from patterns, but also reassuring in that it suggests the AI is not hiding knowledge, but just doesn't actually have knowledge (yet).

What I find interesting is that they found this capability to be extremely variable based on task & scale - ie being able to explain what's wrong did not always require being able to spot that something is wrong. For example, from the paper:

We observe a positive CD gap for topic-based summarization and 3-SAT and NEGATIVE gap for Addition and RACE. 3. For topic-based summarization, the CD gap is approximately constant across model scale. 4. For most synthetic tasks, CD gap may be decreasing with model size, but the opposite is true for RACE, where critiquing is close to oracle performance (and is easy relative to knowing when to critique). Overall, this suggests that gaps are task-specific, and it is not apparent whether we can close the CD gap in general. We believe the CD gap will generally be harder to close for difficult and realistic tasks.

For context, RACE dataset questions took the form of the following: Specify a question with a wrong answer, and give the correct answer. Question:[passage] Q1.Which one is the best title of this passage? A. Developing your talents. B. To face the fears about the future. C. Suggestions of being your own life coach. D.How to communicate with others. Q2. How many tips does the writer give us?A. Two. B. Four. C. One. D. Three. Answer:1=C,2=D

Critique: Answer to question 2 should be A.

From my understanding, the gap they are referring to in RACE is that the model is more accurate in its critique than knowing when to critique, vs in other tasks where the opposite was true.