The world’s leading publication for data science, AI, and ML professionals.

From Local to Cloud: Estimating GPU Resources for Open-Source LLMs

Estimating GPU memory for deploying the latest open-source LLMs

Source
Source

If you’re like me, you probably get excited about the latest and greatest open-source LLMs – from models like Llama 3 to the more compact Phi-3 Mini. But before you jump into deploying your language model, there’s one crucial factor you need to plan for: GPU memory. Misjudge this, and your shiny new web app might choke, run sluggishly, or rack up hefty cloud bills. To make things easier, I explain to you what’s quantization, and I’ve prepared for you a GPU Memory Planning Cheat Sheet in 2024— a handy summary of the latest open-source LLMs on the market and what you need to know before Deployment.

If you are not a member, read here.

Why Bother Estimating GPU Memory?

When deploying LLMs, guessing how much Gpu memory you need is risky. Too little, and your model crashes. Too much, and you’re burning money for no reason.

Understanding these memory requirements upfront is like knowing how much luggage you can fit in your car before a road trip – it saves headaches and keeps things efficient.

Quantization: What’s It For?

Quantization impacts the "brain" of an Llm by simplifying the numerical precision of its weights, which are key to how the model generates text and makes decisions.

  1. Memory and Speed Boost: Reducing from 32-bit to 16-bit or 8-bit precision cuts down memory usage and speeds up inference, making deployment on limited GPUs more efficient. It’s like lightening the brain’s load to think faster.
  2. Trade-offs in "Thinking Power": With simpler, less precise weights, the model might lose some of its ability to handle complex or nuanced tasks, leading to less accurate or lower-quality outputs.
  3. Balancing Efficiency and Accuracy: For most applications, this precision loss is minimal (such as text summarization). But for tasks requiring fine detail, the impact can be more significant (resolving complex problem).

Estimating GPU Memory

To estimate the GPU memory (M) required for an LLM, use the following formula:

Source
Source

Where:

  • M: GPU memory in gigabytes (GB)
  • P: Number of model parameters in billions
  • Q: Bit precision (e.g., 8, 16, or 32 bits)
  • 1.2: A 20% overhead factor for additional memory needs

Example

Consider Grok-1 model from xAI with 314 billion parameters (P = 314) using 16-bit precision (Q = 16):

So, to deploy Grok-1 at 16-bit precision, you would need a whopping 753.6 GB of GPU memory. This clearly shows the massive resource requirements of these large-scale models!

GPU Memory Planning Cheat Sheet in 2024

Made by author - This table gives a snapshot of some impressive open-source LLMs coming out in 2024, highlighting their specs and the GPU muscle they need.
Made by author – This table gives a snapshot of some impressive open-source LLMs coming out in 2024, highlighting their specs and the GPU muscle they need.

From lightweight models like OpenELM to resource-hungry giants like Snowflake Arctic, context lengths vary up to 128,000 tokens, and using 8-bit precision can drastically cut GPU memory needs for efficient deployment.

Smaller models are ideal for solo developers or startups, while quantization helps make larger models feasible on budget-friendly hardware.

Key Takeaways to Make Your Life Easier

  1. Lower Precision Can Save You Big: Using 8-bit precision can drastically cut down on memory use. But keep in mind, it might come at a performance cost. It’s all about trade-offs.
  2. Account for Overhead: That 20% buffer in the formula isn’t just for fun. It helps you avoid nasty surprises like your model stalling due to a lack of memory.
  3. Pick the Right Model for Your Use Case: If you need long context windows for applications like document summarization, models like LWM or Jamba could be good. But watch out for their sky-high memory needs.

Conlusion

Now you have the information to make your own estimation based on your needs. If you’re deploying a model for real-time text generation, you don’t want latency or, worse, for the whole app to crash. And if you’re working in the cloud, optimizing GPU usage can mean thousands of dollars saved over time. This is why understanding these memory estimates is really important.


Loved the Article? Here’s How to Show Some Love:

  • Clap many times – each one truly helps! 👏
  • Follow me here on Medium and subscribe for free to catch my latest posts. 🗞 ️
  • Let’s connect on LinkedIn, check out my projects on GitHub, and stay in touch on Twitter

References

[1] Eugen Yan, Open LLMs: A Collection of Open-Source Models

[2] Hugging Face, Open LLM Leaderboard

[3] EleutherAI, Understanding Transformer Math

[4] Vokturz, Can It Run LLM?

[5] Microsoft Machine Learning Blog, Fundamentals of Deploying Large Language Model Inference


Related Articles