SOLAR model paper questions

post by Bartlomiej Lewandowski (bartlomiej-lewandowski) · 2023-12-29T20:34:38.254Z · LW · GW · No comments

This is a question post.

Contents

No comments

A recent paper has peeked my interest coming from UpstageAI, a Korean AI Lab.

This technique applied in the paper has been used to get the highest scores on the open LLM leaderboard on HuggingFace.

Has anyone read the paper in more detail? I'm struggling to understand the following paragraph which seems to be the core of the paper itself.

"One naive way to up-scale the base LLM would be to repeat its layers once more, i.e., from 32 to 64 layers. This has the benefit that from layers 1 to 32 and from layers 33 to 64, there are no heterogeneity as those layers are taken directly from the base LLM. In other words, the ‘layer distance’, or the difference in the layer indices in the base model, is only bigger than 1 where layers 32 and 33 are connected, i.e., at the seam."

What do they mean by layer distance and why is important?

They later go on to describe a better approach:

"In the first step of DUS, we take the base model, which is the 32-layer Llama2 architecture with Mistral 7B pretrained weights, and make a copy. Next, we slice off the last 8 layers from the original base model and the first 8 layers from the duplicate. This leaves us with two 24-layer models. In the final step, these models are concatenated to form a depth up-scaled model with 48 layers and 10.7 billion parameters. The decision to remove 8 layers from each model was driven by our target performance-to-size tradeoff. By discarding what would have been the middle layers in the up-scaled model, the layer distance at the seam is reduced as layer 24 of the first model to layer 9 of the second are connected instead of layer 32 and 1, respectively."

Sure this makes sense - the distance between the layer at the "seam" is much smaller, but this still doesn't shed any light on how the network would perform better.

Does anyone have insights on that?

Answers

No comments

Comments sorted by top scores.