The Gallery for Painting Transformations - A GPT-3 Analogy

post by Robert_AIZI · 2023-01-19T23:32:55.994Z · LW · GW · 0 comments

This is a link post for


  The Paintings
  The Artists
  The Filters
No comments

[Target audience: Me from a month ago, people who want a sense of what a transformer is doing on a non-technical level, and people who want to chunk their understanding of transformers.]

Imagine a special art gallery, the Gallery for Painting Transformations (GPT). It takes in a sentence and makes a simple set of paintings to represent it, at first hardly more than stock images. Each morning, a team of artists come in and add to the paintings, and each evening, a snapchat-style filter is applied to each painting individually. Thanks to these efforts, over time the paintings in the gallery grow in meaning and detail, ultimately containing enough information to predict a painting which would be a good addition to the gallery.

This is how GPT-3 works, or at least (I claim) a reasonable analogy for it. In this post I want to flesh out this analogy, with the end result that even non-ML people can get a sense of what large language models are doing. I have tried to be as accurate in how I analogize things as possible, with the notable exception that my examples containing semantic meaning are a lot more human-comprehensible than what GPT-3 is doing, and in reality everything it does looks like “perform a seemingly-random vector operation”. Any mistakes are intentional creative flourishes my own[1].

The Paintings

The gallery holds 2048()[2] paintings, which represent the model’s context window (the prompt you feed into GPT-3). In GPT-3, each word is represented by a 12288()-dimensional vector. Fortuitously, that’s exactly enough dimensions to make a 64-by-64 pixel RGB image!

The gallery first opens when a user feeds in a prompt, and each token (tokens≈words) in the prompt is turned into a painting which looks like a generic stock image, the “token embedding”. If you feed in more or less than 2048 tokens, the message is trimmed or padded with a special token.

How the gallery would look when it opens if the prompt was “a bear eats apples”, consisting of stock images for each word and positional embeddings in the bottom-left of each painting.

The gallery itself has no set order, there are simply paintings that can be viewed in any order. Instead, since word order is sometimes important in sentences, the gallery also paints on a little indicator to denote sentence position, the “position embedding”. In the previous image I showed this as drawing a little number in the corner of each painting, though this can be far more complicated. Since the position embedding is part of the painting, it can also be transformed over time.

At this point the gallery has paintings, but they’re somewhat simplistic. Over the next 96() days, the gallery transforms the paintings until they are rich in meaning, both individually and as a collection. Each day represents a layer of the network, where a new team of artists (attention heads) works during the morning and a new filter (feed-forward network) is applied each evening.

The Artists

The artists in this analogy are the transformer’s attention heads. Each day, a new team of 96() artists come in and add a few brush strokes to each the canvasses depending on what they see. Each artist has two rules that decides what they paint:

Keep in mind that since these are vectors, you should think of this paint as “additive” with itself and the original canvas (instead of literally "painting over and rewriting the original value). All artists work simultaneously on all canvases, and the gallery as a whole has been trained so that the artists work in harmony rather than interfering with each other.

Let’s see how this works in practice by following one hypothetical artist. For simplicity, let’s pretend the gallery only contains two paintings (the others could be padding tokens which the artists are told to ignore).

  1. In painting X, a person is climbing a tree, and in painting Y a person is eating an apple.
  2. Our artist is a “consistency-improving” artist who tries to make all the paintings tie into each other, so the first rule tells them that X should draw a similar-looking tree in the the background, while Y should draw more apples.
  3. Next, the artist thinks about where attention should be directed. Using their second rule (including the position embeddings suggesting X happens before Y), they decide the attention on X should come .5/.5 from X and Y, respectively, and the attention on Y should come .99/.01 from X and Y, respectively.
  4. Drawing the first-rule images to the strengths determined by the second rule, painting X gets a background tree and some apples in the branches of the foreground tree, while painting Y gets a background tree (and some very faint outlines of ghost apples).
  5. Thanks to this artist, the paintings are now more narratively consistent: instead of an unrelated tree-climbing and apple-eating, the paintings show that someone picked apples from an apple orchard, and then ate the apple while still near the tree.

Other artists might perform other tasks: maintaining a good balance of colors, transferring details, art-styles, or themes between canvases, showing cause and effect, or [incomprehensible and seemingly-arbitrary complicated matrix calculation that is somehow essential to the whole network]. Overall, the purpose of the artists is to carry information between canvases.

The Filters

After the artists are finished each day, a filter is applied to each painting. On any particular night the same filter is applied to each painting, but a new filter is used each night.

An example of a filter the gallery might use. The same “apply dog ears and sepia tint” filter would be applied to all images in the gallery that night.

In more technical terms, the filter is a feedforward network consisting of the input painting (12288 width), a single hidden layer (49152=4*12288 width), and an output layer (12288 width). Writing  and  for the weight matrices,  and  for the bias vectors, and α for the activation function (GPT uses GELU), the output of the filter sublayer is given by  is size 1-by-12288,  is size 12288-by-49152,  is size 1-by-49152,  is size 49152-by-12288, and  is size 1-by-12288.

As with the attention heads, filters use a “residual connection”, meaning that the filter calculates  for each painting , and returns  (not  alone). Two intuitive arguments for why we might want residual connections:

What can these filters do? In principal, a great deal - by the universal approximation theorem, they could learn any function on the input if there is enough width in their hidden layer, and this hidden layer is pretty wide. Their only limitations are that the same filter is applied to every painting at once, independently of the other paintings. I imagine them as generic “improvements” to the paintings - upscaling, integrating the additions of the previous day’s artists, and of course [incomprehensible and seemingly-arbitrary complicated matrix calculation that is somehow essential to the whole network]. Overall, the purpose of the filters are to refine and evolve the paintings in isolation from each other.


And that’s what GPT does! To summarize:

There are a few wrinkles I left out, which I’ll touch on here:

  1. ^
  2. ^

    Where possible, I will include the parameter name from the GPT-3 paper in addition to the number.


Comments sorted by top scores.