XAI releases Grok base model
post by Jacob G-W (g-w1) · 2024-03-18T00:47:47.987Z · LW · GW · 3 commentsThis is a link post for https://x.ai/blog/grok-os
Contents
Model Details None 3 comments
We are releasing the base model weights and network architecture of Grok-1, our large language model. Grok-1 is a 314 billion parameter Mixture-of-Experts model trained from scratch by xAI.
This is the raw base model checkpoint from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.
We are releasing the weights and the architecture under the Apache 2.0 license.
To get started with using the model, follow the instructions at github.com/xai-org/grok.
Model Details
- Base model trained on a large amount of text data, not fine-tuned for any particular task.
- 314B parameter Mixture-of-Experts model with 25% of the weights active on a given token.
- Trained from scratch by xAI using a custom training stack on top of JAX and Rust in October 2023.
This is one of the biggest open source model releases I've seen, and it's also one of the only ones I've seen that releases the base model right after pretraining. This is pretty wild stuff!
3 comments
Comments sorted by top scores.
comment by O O (o-o) · 2024-03-18T07:45:59.804Z · LW(p) · GW(p)
Much larger than I expected for its performance
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-03-19T01:18:14.786Z · LW(p) · GW(p)
This way it's probably smarter given its compute and a more instructive exercise before scaling further than a smaller model would've been. Makes sense if the aim is to out-scale others more quickly instead of competing at smaller scale, and if this model wasn't meant to last.
comment by Shankar Sivarajan (shankar-sivarajan) · 2024-03-18T05:03:06.026Z · LW(p) · GW(p)
How expensive is the finetuning step relative to the pretraining (in terms of compute, data, labor, or anything else)?
I gather it'd be ~$1000 to "uncensor" a finetuned model, but as mentioned, this might be the first significant model released before finetuning, so I have no intuition for this. Two orders of magnitude more? Three?