Analyzing the Problem GPT-3 is Trying to Solve

post by adamShimi · 2020-08-06T21:58:56.163Z · LW · GW · 2 comments


  The Promptp Search Problem
  The BestPromptp Optimization Problem
  Solving the Task

I think that taking a theoretical computer science perspective to GPT-3 might help clarify some part of the debate.

My background is in theoretical computer science, and that's my prefered perspective to make sense of the world. So in this post, I try to rephrase the task we ask GPT-3 to solve in terms of search and optimization problems, and I propose different criteria for solving such a task. I thus consider GPT-3 as a black-box, and only ask about how to decide if it is good at its task or not.

Epistemic status: probably nothing new, but maybe an interesting perspective. This is an attempt to understand the intuitions behind different statements about whether GPT-3 "solves some task".

The Search Problem

Let's fix a language (English for example). Then a very basic description of the task thrown at GPT-3 is that it's given a prompt and it needs to give a follow-up text such that is a good answer to the prompt. For example, might be a question about parenthesis or addition or a story to complete, and is GPT-3's answer. I thus define to be this search problem for a given prompt .

Now, what does it mean for an algorithm to solve ? Multiple possibilities come to mind:

Okay, so GPT-3 is solving iff it is a probabilistic solution to it. Is it enough?

Not exactly. In some cases, we don't want a binary choice between good and bad answers; there might be a scale of answers which are more and more correct. That is to say, it should be an optimization problem.

The Optimization Problem

To go from a search problem to an optimization problem, we need a measure of how good the solution is. Let's say we have a function . The optimization problem can then be defined with our probabilistic criterion in multiple ways:

This new criterion feels much closer to what I want GPT-3 to do when I give it a prompt. But I don't know how to write a measure function that captures what I want (after all, that would be a utility function for the task!). We can do without, by instead comparing two answers with each other. I probably have a couple of answers I am waiting for when writing my prompt; and when I receive an answer, I feel that comparing it to one of my expected answer should be possible.

In consequence, I give the following criterion. Assume the existence of a set of expected answers and of a comparison relation between answers . Then

(As Good As Expected) solves if , with .

I am personally happy with that criterion. And I think it is rather uncontroversial (with maybe the subtlety that some people would prefer a in place of the ).

But we're not done yet. This only constrains the algorithm about what it does for a fixed prompt. A big part of the discussion of GPT-3 turns around the number of prompts for GPT-3 should give an interesting answer. This requires a step back.

Solving the Task

When evaluating GPT-3, what I really care about is its answer for a specific task (like addition or causality). And this task can be encoded through many prompts. Let's say that we have a set of prompts that somehow capture our task (from our perspective). What constrains should an algorithm satisfy to be said to solve the task? We already have a meaningful criterion for answering correctly to a prompt: the "As Good as Expected" one. What's left to decide is for which prompts in should satisfy this criterion.

Contrary to the part about solving a prompt, the answer doesn't look obvious to me here. I also believe that much of the disagreement about GPT-3's competence is based on a disagreement about this part.


Thinking about GPT-3, and the sort of problem we want it to solve, through the lens of theoretical computer science, helped me understand slightly better the debate surrounding it. I hope it can be helpful to you too. And I'm also very curious of any error you might find in this post, whether conceptual or technical.


Comments sorted by top scores.

comment by avturchin · 2020-08-07T10:26:38.895Z · LW(p) · GW(p)

What do you think about a possible comparison between GPT-3 and AIXI?

Oversimplification: If AIXI will get a sequence, it will produce a set of hypothesis about the most probable explanations of the sequence and based in them, a set of possible continuations can be generated. GPT-3 misses the hypothesis generating part and directly generate the continuations.

AIXI is known to be a model of truly general intelligence.

Could GPT-3 be used or regarded as an approximation of AIXI?

comment by Zachary Robertson (zachary-robertson) · 2020-08-07T12:52:32.933Z · LW(p) · GW(p)

I've argued [LW · GW] that GPT3 can do a form of boosting if you pair it with an output filter. In terms of the language introduced here we have something like where filters the output with some probability according to . So if we have strong reason to believe that a good enough prompt exists then as we iterate the prompt GPT3 will get better at matching the output which allows for improvement.