Ben Livengood's Shortform

post by Ben Livengood (ben-livengood) · 2023-02-20T18:07:29.378Z · LW · GW · 1 comments

1 comments

Comments sorted by top scores.

comment by Ben Livengood (ben-livengood) · 2023-02-20T18:07:29.660Z · LW(p) · GW(p)

https://github.com/Ying1123/FlexGen is a way to run large (175B parameter) LLMs on a single GPU at ~1 token/s which I think puts it within the reach of many hobbyists and I predict we'll see an explosion of new capability research in the next few months.

I haven't had a chance to dig into the code but presumably this could also be modified to allow local fine-tuning of the large models at a slow but potentially useful rate.

I'm curious if any insights will make their way back to the large GPU clusters. From my cursory glance it doesn't seem like there are throughput or latency advantages unless weight compression can be used to run the entire model on fewer GPUs with e.g. swapping layer weights in and out and caching latey outputs in batch inference.