Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

post by RobertKirk · 2023-07-20T09:56:05.574Z · LW · GW · 2 comments

Contents

  Terminology
  Linking terminology together:
  Current Results
  Speculation
    Speculative mechanistic explanation
  Implications
None
3 comments

TL;DR: I claim that supervised fine-tuning of the existing largest LLMs is likely path-dependent (different random seeds and initialisations have an impact on final performance and model behaviour), based on the fact that when fine-tuning smaller LLMs, models pretrained closer to convergence produce fine-tuned models with similar mechanisms while this isn’t the case for models pretrained without being close to convergence; this is analogous to current LLMs that are very far from convergence at the end of training. This is supported by linking together existing work on model souping, linear mode connectivity, mechanistic similarity and path dependence.

Epistemic status: Written in about two hours, but thought about for longer. ​Experiments could definitely test these hypotheses.

Acknowledgements: Thanks to Ekdeep Singh Lubana for helpful comments and corrections, and discussion which lead to this post. Thanks also to Jean Kaddour, Nandi Schoots, Akbir Khan, Laura Ruis and Kyle McDonell for helpful comments, corrections and suggestions on drafts of this post.

Terminology

Linking terminology together:

Overall this gives us this picture of properties a training process can have:

Current Results

  1. Linear Connectivity Reveals Generalization Strategies shows that different fine-tunes of BERT on the same task are often linearly disconnected. In Appendix J they show that this isn’t the case for different fine-tunes of RoBERTa, with the main difference between BERT and RoBERTa being much longer pretraining on more data.
  2. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance shows that different fine-tunes of BERT can get radically different generalisation performance (similar to above).
  3. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time shows that model souping doesn’t improve results for BERT very consistently, but does so slightly more consistently for T5.
  4. Knowledge is a Region in Weight Space for Fine-tuned Models shows that fine-tuning RoBERTa works for model souping, even when fine-tuning on different datasets representing the same underlying task (and retraining the final linear layer). Hence (as in point 1) RoBERTa fine-tuning produces LMC and souping works.
  5. Exploring Mode Connectivity for Pre-trained Language Models finds mode-connectivity for fine-tuned T5 on two NLP tasks across different data orders, random inits, subsampled datasets, and to a lesser extent related tasks (similar to the previous paper). They also show (in figure 6) how later pretraining checkpoints (of a RoBERTa-BASE model) are more likely to lead to LMC.
    1. Note that I find this paper less convincing generally because the experiments are less rigorous (they only train a single pair of models for each experiment), however it is in line with other works and my speculation further on.
  6. T5 and RoBERTa are pretrained for significantly longer than BERT - BERT is not converged at the end of pretraining.
  7. Learning to summarize from human feedback appendix C paragraph 5 says that for reward model training they do model selection over 3-10 random seeds and shows that it improves performance. This implies this fine-tuning process is quite path-dependent
    1. Their base model is probably an earlier version of small GPT-3, and was trained for “1-3 epochs” in total. I speculate that the base model is not converged at the end of training, similar to GPT-3.

Takeaway: BERT, and the base models in Learning to summarize from human feedback, are probably not trained to convergence, or even close to it. Here, supervised fine-tuning is path dependent - different random seeds can get dramatically different results (both for reward modelling and standard NLP fine-tuning). Models that are trained closer to convergence (T5, RoBERTa, the pretrained vision models in the model soup work) show more gains from model souping, and hence the supervised fine-tuning process produces LMC models and is therefore likely path-independent. Note that this is still only true for reasonable learning rates - if you pick a very large LR then you can end up with a model in a different loss basin, and hence not LMC and not mechanistically similar.

Speculation

Existing large language models are trained for only a single epoch because we have enough data, and this is the compute-optimal way to train these models. This means they’re not trained until convergence, and hence more like BERT than RoBERTa or T5. Hence, supervised fine-tuning these models will be a path-dependent process: different runs will get different models that are using different predictive mechanisms, and hence will generalise differently out-of-distribution. Larger learning rates may also lead to more path dependence. This provides a more fine-grained and supported view than Speculation on Path-Dependance in Large Language Models. [AF · GW

Speculative mechanistic explanation

The pretrained model infers many features which are useful for performing the fine-tuning task. There are many ways of utilising these features, and in utilising them during fine-tuning they will likely be changed or adjusted. There are many combinations of features that all achieve similar performance in-distribution (remember that neural networks can memorise random labels perfectly; in fine-tuning we’re heavily overparameterised), but they’ll perform very differently out-of-distribution.

If the model is more heavily trained during pre-training, it’s likely a single set of features will stand out as being the most predictive during fine-tuning, so will be used by all fine-tuning training runs. From a loss landscape perspective, the more heavily pre-trained model is deeper into a loss basin, and if the fine-tuning task is at least somewhat complementary to the pretraining task, then this loss basin will be similar for the fine-tuning task, and hence different fine-tunes are likely to also reside in that same basin, and hence be LMC.

Implications

2 comments

Comments sorted by top scores.

comment by Kushal Thaman (Kushal_Thaman) · 2024-01-03T06:28:25.554Z · LW(p) · GW(p)

Thanks for the post! Do you think there is an amount of pretraining you can do such that no fine-tuning (on a completely non-complementary task, away from pre-trained distribution, say) will let you push out of that loss basin? A 'point of no return' s.t. even for very large values of LR and amount of fine-tuning you will get a network that is still LMC?

Replies from: RobertKirk
comment by RobertKirk · 2024-01-08T09:16:03.453Z · LW(p) · GW(p)

I think a point of no return exists if you only use small LRs. I think if you can use any LR (or any LR schedule) then you can definitely jump out of the loss basin. You could imagine just choosing a really large LR to basically resent to a random init and then starting again.

I do think that if you want to utilise the pretrained model effectively, you likely want to stay in the same loss basin during fine-tuning.