Jan 17, 2024
S-LoRA describes a set of optimizations for running thousands of separate LLMs simultaneously on the same GPU. At OpenPipe we’ve been running S-LoRA in production since January 4th, which critically allowed us to eliminate the cold-start problem for infrequently-used models. I wanted to share some of our learnings from the implementation process here!
But first, here’s the average cold-start response time we’re seeing after enabling the S-LoRA based pipeline:
The Problem of Weights
Modern LLMs require a lot of GPU RAM. A “small” model like Mistral 7B requires 14GB of RAM just to hold the weights, in addition to the working memory required for the KV cache, which can be multiple GB for long sequences. This means that even a very beefy GPU like an A100-40GB only has room to load one or maybe two 7B LLMs in RAM at once. Quantization can reduce the required memory, but it also leads to decreased throughput, and often hurts response quality as well.
This is not really a problem if you’re using one general-purpose model for everything, and just steering its behavior via prompting. In that case you can just load up your model on one GPU and call it a day. But fine-tuning is a far more reliable way of directing model behavior than prompting. Concretely, we’ve found that 7B models fine-tuned on a good dataset consistently outperform prompted GPT-3.5 (20B parameters), and even come within striking distance of GPT-4 (1.7T parameters)!
The downside, of course, is that now you have to figure out how to serve all those task-specific fine-tuned models efficiently. Spinning up a dedicated GPU for each model is a non-starter because it leads to low GPU utilization, which is an existential issue because of how expensive GPU time is ($2+/hr for an A100). How do we square the circle?
Serving all the models everywhere all at once
First, a bit of background: in 2021 a new fine-tuning method called LoRA was published. The key insight is that fine-tuning only a tiny fraction of the base model’s weights can give you similar results to fine-tuning all of them, since you want your fine-tuned model to keep most of the world understanding and reasoning ability of its base. The LoRA technique involves cleverly inserting extra adapter layers in a few carefully-selected locations and only fine-tuning those. These adapters are analogous to a “git diff” that encodes only the difference in weights between the base model and your fine-tune.
These adapters can be tiny. In OpenPipe’s case, our Mistral adapters are 80MB each, only 0.5% the size of the 14GB base model. This immediately points to the shape of the solution: is it possible to load many adapters from the same base model onto one GPU and use them simultaneously, efficiently?
It turns out the answer is “yes”! Two influential papers from late 2023 help define the solution.
Punica implements a clever CUDA kernel that is able to batch-process requests from many LoRA adapters simultaneously. This custom kernel is essential, because the naive approach taken by most libraries pre-Punica required swapping adapters for each request, eliminating the critical throughput increases from serving many requests in parallel.
S-LoRA builds on Punica and adds a tiered caching architecture. It dynamically stores the most-recently-used adapters in GPU RAM, less-recently-used adapters in system RAM, and the least-recently-used adapters on disk. For a typical setup with 10GB of available GPU RAM and 1TB of system RAM, S-LoRA might store 125 adapters on the GPU and over 10K in system RAM. The overhead of restoring an adapter from system RAM to the GPU is negligible in practice; an A100 has 31GB/s of interconnect bandwidth so an 80MB adapter can be transferred in 2.4ms. This can happen in parallel with serving other requests.
This actually works!
On January 4th we deployed an experimental inference pipeline based on a vLLM fork that implements the relevant optimizations. After manually moving a few models over and closely monitoring performance, we enabled the pipeline for all new models on January 10th, and began porting over old models as well.
Over the course of this transition, the average number of GPUs in use has dropped by over 70%, even as the number of requests we serve has continued increasing! Our average response time for models coming up from a cold start (ie weights not already loaded onto a GPU) decreased from 45 seconds to 1 second, giving customers a lot more flexibility to deploy many small specialist models. And ultimately, that’s exactly what we’re here to do. 🙂