Exploring the Evolution and Migration of Different Layer Embedding in LLMs

post by Ruixuan Huang (sprout_ust) · 2024-03-08T15:01:17.504Z · LW · GW · 0 comments

Contents

  Do embeddings form different spaces?
    Reduction
    Machine learning method
      Ablation Study
        Question 1: Visualization of classify attribution?
        Question 2: Dataset transferability?
        Question 3: Training stability?
        Question 4: Just learned the norm?
    Conclusion
  Further Plans
None
No comments

[Edit on 17th Mar] After conducting experiments on more data points (5000 texts) on the Pile dataset (more sample sources), we are confident that the experimental results described earlier are reliable. Therefore, we have opened the code.

Recently, we conducted several experiments focused on the evolution and migration of token embeddings across different layers during the forward processing of LLMs. Specifically, our research targeted open-source, decoder-only architecture models, such as GPT-2 and Llama-2[1].

Our experiments are initiated from the perspective of an older research topic known as the logit lens [LW · GW]. Utilizing the unembedding matrix on embeddings from various layers is an innovative yet thoughtless approach. Despite yielding several intriguing observations through the logit lens, this method lacks a solid theoretical foundation, rendering subsequent research built upon it potentially questionable. Moreover, we observed that some published studies have adopted a similar method without recognizing it in community[2]. A primary point of contention is that the unembedding matrix is tailored to the embeddings of the final layer. If we view the unembedding matrix as a solver, its solvable question space is inherently designed for embeddings from the last layer. Applying it to embeddings from other layers introduces a generalization challenge. The absence of a definitive ground truth for this extrapolation means we cannot conclusively determine the validity of the decoder's outputs.

Fig 1.  Hypothesis: Embeddings output from different layers, along with input token embeddings, each occupy distinct subspaces. The unembedding matrix (or decoder) is only effective within the subspace where it was trained. While applying it to other layers' embeddings is mathematically feasible, the results may not always make sense.

Next, we'll introduce several experiments about this hypothesis.

Do embeddings form different spaces?

It's crucial to acknowledge that the question of reasonableness may be inherently unanswerable due to the absence of a definitive ground truth. Our hypothesis is merely a side thought inspired by this question. Therefore, related studies such as Tuned Lens[3], which seek to address this question of reasonableness, fall outside the scope of our discussion.

Intuitively, to identify whether there exist subspaces, we can do some visualization about embeddings from different layers (like Fig 1). We tried two methods: reduction and using machine learning methods. Reduction can provide an intuitive image of the condition, while machine learning methods may reveal something deeper.

Reduction

We've tried several reduction methods, including PCA, UMAP and t-SNE. These methods are not selective but chosen based on experience. The effect is very poor when using UMAP and t-SNE. After examining the principle and scope[4], we select PCA for demonstration.

Fig 2.  To avoid any doubt, this picture presents the results of all three mentioned reduction methods. For how to understand this picture, please continue reading.

Maybe now we should clarify how to obtain the training dataset. As we all know, only when input is provided will there be output embeddings of different layers. We firstly did these experiments on GPT-2 model family. The most convincing dataset may be WebText, the training set for GPT-2, but unfortunately, it seems to have not been made public so far[5]. So, we still used the classic text dataset IMDB and AG_NEWS.

We randomly selected N datapoints from the dataset. For each datapoint, we truncated its length to a hyperparameter L. You can view this as the most token length of input we could handle, or as a context window. In Fig 2, N = 400, L = 50. Use GPT-2 as an example, it has 12 layers, and each embedding has 768 dimensions. Finally, we get a tensor shaped .[6] Each line of the tensor has a layer label, and will be unified reduced to shape  for visualization. The different color you see stands for different layer, with an order in reverse-rainbow.

At first, we tried the three reduction methods. PCA perfectly illustrated the ideal result we want. The results of t-SNE may appear more cluttered, but there are also some trends we want. The results of UMAP are completely inadequate, which may be due to the fact that UMAP itself is not suitable for this task[4]. In the following sections, we'll only take PCA results into account.

We noticed that the reduction data points of layer 11 are far from those in layer 12 (the last layer). We want to know why, so we tried to plot the embedding norm statistics in the upper left corner of Fig 2. We consider an embedding as a vector and calculate the maximum, minimum, and average L2 norm for each layer of embedding. It can be clearly seen that the norms of the last layer embeddings are much larger than those from other layers, and the overall trend is exponential[7]. Then we did a normalization. The result after normalization is much reasonable, see Fig 3.

Fig 3.  Reproduce Fig 2 experiments. Only add a normalization.

Undoubtedly, the embedding reduction data points in each layer form a trend similar to "drift" (in some papers, this abstract phenomenon was once referred to as representation drift. However, to our knowledge, we are the first to visualize this phenomenon)

We also want to talk about another set of reduction experiments. If you have already read the content of this section above and are not very interested, you can skip the following content of this section and go straight to the next section.

The experiments above didn't consider the input embedding of layer 0[6]. What about taking account of it? Some readers may not be aware that the GPT-2 family models are sharing weights. The matrix used to convert the token index into the embedding of the input of layer 0 is the same as the unembed matrix of the last layer during training. So, we see that there seems to be some kind of ring (UMAP and PCA) in the reduction result of GPT-2 in Fig 4. Of course, this is just speculation, as it does not exist in the GPT-2 XL.

Fig 4.  Take layer 0 input embeddings into account.

Machine learning method

The task described above is somewhat like a multi classification task. Taking GPT-2 as an example, an embedding with 768 dimensions can be considered as features for multi classification tasks. If layer is used as the label, can an interpretable model (such as Linear Regression or Decision Tree) be trained to classify which layer the input 768-dimensional embedding belongs to?

Fig 5.  An illustration of this experiment. In a single experiment we will use a set of input text corpuses (a single input text corpus referring to T) and different inference context window (i.e. w in the figure).

We did this experiment with the same setup as the reduction part, including the input embedding of layer 0. We trained a LR model to do this task. To our surprise, the classification accuracy has reached an astonishing 92%! The following figures demonstrates classification accuracy using different subject models and context gathering windows. We assumed different text datasets wouldn't take crucial effect of this. So, we only use IMDB dataset for this section.

Fig 6.  Basic setup experiment results on different GPT-2 family models, inference context window and training datapoints number. The four figures stand for four GPT-2 model size. In each figure, the accuracy data is grouped by inference context window, showing with different color.

Let's qualitatively analyze these experimental results.

What's the result stands for? According to our usual mindset, the model has much less parameter scale than the training dataset. Therefore, embedding or representation must be very high-dimensional and reusable. Only in this way can the model have such powerful capabilities. If there is a clear distinction between the hidden spaces of different layers, does it mean that each layer's representation only utilizes a subspace? Does that mean that the model still has great potential to gain more capabilities?

Ablation Study

A very natural question at this point is, what did the linear model learn? We have decided to continue conducting some ablation experiments based on the experimental setup of (model_name="gpt2-medium", inference_context_window=30, datapoints_num=1000).

Question 1: Visualization of classify attribution?

From the trained linear model, we can extract weights and analyze classifying attribution. Will there be specific vector dimensions that make a significant contribution to the classification accuracy of each layer? We calculated the percentage contribution of each vector dimension to its prediction for each layer.

Fig 7.  For each layer, the percentage contribution of each vector dimension to its prediction. We only displayed the top 8 dimensions with the highest contribution for each layer, with positive contributions marked in red and negative contributions marked in blue.

Through this analysis, we can make a preliminary judgment that it is not easy for a linear model to classify a vector as belonging to a certain level with a small number of dimensions, because the maximum contribution dimension only contributes about 1%. However, we can still draw some conclusions from it and do some extra experiments.

Fig 8.  We arrange the weights of each layer in descending order and then calculate their cumulative contribution percentage. We found that the cumulative curve is almost consistent, and the number of features required for 50% and 90% contributions is also almost the same.

Question 2: Dataset transferability?

Although we previously speculated that the source of the dataset would not have a critical impact on the results, we still want to explore the transferability of this result. Can the performance of the classification model previously trained on the IMDB dataset still be maintained if we use the AG-NEWS dataset to construct another vector set?

If we change the dataset to AG-NEWS without modifying the trained linear model, the accuracy of classification will decrease from 83% to 66%. This does indeed differ from our previous assumptions. What are the differences between different datasets used for training? 

Question 3: Training stability?

Although there is no additional expansion experiment plan, it is evident that the stability of the training is guaranteed to a certain extent. One piece of evidence is that the experiment in Figure 8 was conducted multiple times, and the phenomena that frequently occur in dimensions 591, 690, 581, and 239 mentioned earlier were shared in multiple experiments. And the accuracy range of the model is also guaranteed.

Question 4: Just learned the norm?

Since we know that embeddings at different layers have an exponential growth trend[7], is it possible that the model only learned to classify based on the norm of the vectors? Although this sounds reasonable, before conducting the experiment, we can observe that the norm interval of the early layers is not much different. If we only learned to classify according to the vector norm interval, it is impossible to achieve such high accuracy.

Fig 9.  Compare the loss and accuracy curves after normalizing each sample norm to 1.
Fig 10.  Experiments based on randomly generated vectors. The mean and variance of the sample set for each layer are consistent with the original.

Conclusion

We have basically proved that embeddings of different layers belong to different subspaces.

Further Plans

Because I need to invest in research on other topics temporarily, I have written this post first and hope to receive your comments.

Below are some directions for expanding on this topic:

We have explored some concept level causal analysis related to logit lens, and we will post again if there is time. We currently do not have time to carefully beautify the code of this article for public release. If there are many followers, it will inspire us to do this.

If our work does benefits to your research, please cite as:

@misc{huang2024exploring,
  title={Exploring the Evolution and Migration of Different Layer Embedding in LLMs},
  author={Huang, Ruixuan and Wu, Tianming and Li, Meng and Zhan, Yi},
  year={2024},
  month={March},
  note={Draft manuscript. Available on LessWrong forum.},
}
  1. ^

    We believe that the insights from these experiments could be applicable to other models as well, provided that embeddings can be obtained from all layers.

  2. ^

    Transformer Feed-Forward Layers Are Key-Value Memories

    Explaining How Transformers Use Context to Build Predictions

  3. ^

    Eliciting Latent Predictions from Transformers with the Tuned Lens

  4. ^

    UMAP seems to ignore the differences between different layers in the geometric structure, making it unlikely to find the phenomenon we want; t-SNE is similar.

  5. ^

    In OpenAI's work Language models can explain neurons in language models, the authors claim they use WebText as training dataset. Perhaps some WebText data points can be pieced together from their publicly available datasets, but we think this is too strict. Let's still use regular datasets (like IMDB).

  6. ^

    Why 13 instead of 12? The additional one is the input embedding of layer 0. Please forgive me for not being rigorous here. Fig 2 may not have taken into account the input embedding of layer 0. In the more formal results later on, they were all considered. Currently this can be seen as an overview.

  7. ^

    This phenomenon was studied by Residual stream norms grow exponentially over the forward pass.

  8. ^

    Maybe related: Matryoshka Representation Learning

0 comments

Comments sorted by top scores.