New LLM Scaling Law

post by wrmedford · 2025-02-19T20:21:17.475Z · LW · GW · 0 comments

This is a link post for https://github.com/wrmedford/moe-scaling

Contents

No comments

Hi all, 

I'm an independent researcher, and I believe I came across a new scaling law for Mixture of Experts models. I'd appreciate any review and critique. This challenges the notion that performant inference and training must hold all weights in VRAM, and suggests that as long as bus speeds are sufficient (like on modern hardware like NVIDIA's GH200), even NVMe could be a viable option for storing weights without a measurable performance degradation. 

I am doing this in my free time on my own dime, so please forgive any mistakes. I promise they were made in good faith. 

0 comments

Comments sorted by top scores.