Written by Oleksiy Protsyk
While Large language models (LLMs) have become essential tools used by a wide range of industry professionals, they can still be tricky for mid to smaller sized businesses to actually use long-term. API-based solutions like Chat-GPT, Claude, or Gemini are quick to integrate and easy to use, but they are costly, especially for organizations handling high volumes of requests. Beyond the API fees alone, there are additional expenses to consider, including the time and resources required to set up and manage the API, as well as the ongoing costs associated with its usage. Open-source LLMs offer a powerful, cost-saving alternative. Running an open-source model on moderately priced hardware can reduce costs and allow companies to customize performance. And with the right optimizations, these models can handle high request volumes, making them ideal for companies with frequent usage, where API costs would quickly add up. We’ve previously discussed how to select the best LLM for your needs, but in this article, we’ll explore how to actually optimize LLMs for cost-efficient, production-ready setups.
While most articles focus on high-end hardware, we specifically hone in on the potential of mid-range hardware. We chose Nvidia A10G GPU, a solid mid-range hardware option priced at $1.212 per hour on Amazon Web Services. Our model, the recently released LLAMA 3.1 8B Instruct, is a compact yet powerful LLM that ranks highly on the Large Model Systems Organization (LMSYS) chatboard. We hope to illustrate what kind of speed and efficiency you can expect from an open-source LLM on mid-range hardware and reveal how smaller setups can compete with high-performance options with the right optimizations.
Experiment setup
This post focuses on exploring the benefits and performance of some optimization techniques described later. We used a set of prompts with different lengths. This helps us understand how the model handles shorter and longer tasks after each optimization.
We measured performance using two key metrics:
- Time to First Token (TTFT): This is the time it takes for the model to read the prompt and start generating the first word or response.
- Time per Output Token (TPOT): This is the average time the model takes to generate each word after it has processed the prompt.
We tested prompts of different lengths, ranging from very short (11 tokens) to much longer tasks (up to 1,539 tokens), with some representing conversations and others being more complex tasks with examples.
Off-the-shelf tools: performance limitations
When you first run an LLM model, the easiest way to get it up and running is through Huggingface’s Transformers library without any optimizations, however, the performance can be quite slow.
Running the model in its default configuration (fp32 precision), we saw Time to First Token (TTFT), which measures how long it takes to generate the first part of the response, staying above 1 second. On average, the Time per Output Token (TPOT), or how long it takes to generate each word after that, was around 1.24 seconds. Additionally, the model consumed a significant amount of memory, using more than 20GB of VRAM.
Lower precision
To make LLMs more efficient and reduce costs, one of the simplest and most effective techniques is to use lower precision. But why do we want the model to run faster in the first place, and how does that relate to cost?
In any machine learning project, running models efficiently is critical because faster models reduce the time spent processing data, which directly translates into lower operational costs—especially when you're running on cloud services that charge by the hour. Faster performance also means you can handle more tasks in less time, making your system more scalable and responsive. However, models can be slow and consume a lot of memory, particularly with large-scale models like LLMs. This is where precision comes into play.
Lowering precision means storing and processing data in a way that uses less memory, which is where Tensor Cores on modern GPUs come in. By switching from full precision (fp32) to half precision (fp16), you can dramatically speed up the model while reducing its memory usage. For most tasks, this reduction in precision has little to no noticeable impact on the quality of the model's output. In fact, the tradeoff between fp32 and fp16 accuracy is so minor that it's often worth it for the considerable performance gains, especially when you're running the model on hardware with limited VRAM.
On a modern architecture like the Nvidia H100 it can even go down to float8, but this is beyond our scope of research.
The difference between float16 and bfloat16 is the following:
Float16 uses 1 sign bit, 5 bits for the exponent, and 10 bits for the mantissa (decimal point), while bfloat16 uses 1 sign bit, 8 bits for the exponent, and 7 bits for the mantissa. Although both formats represent a number in 16 bits, the precision and range are handled differently. Bfloat16 focuses more on the exponent, allowing it to cover the same wide range as float32, but at the cost of losing some decimal precision, which float16 provides. To understand how the precision works, a simple formula is used:
Where:
- The sign determines if the number is positive or negative.
- The exponent controls how large or small the number can be (with the bias depending on the precision type).
- The mantissa stores the fractional part, determining the precision.
Plugging in the values from the graph above, we get a range between -3.4e^38 to 3.4e^38 for float32 and bfloat16, while for float16 has a range of -6.55e^4 to 6.55e^4. This trade-off highlights a key difference: float16 provides better decimal precision but has a limited range. On the other hand, bfloat16 sacrifices some precision to cover a broader range of values. One potential problem with float16 is the risk of overflow—when converting from float32 to float16, numbers that fall outside float16’s limited range will overflow, potentially causing errors in the model, so it is important to understand the model you are working with.
Looking at the metrics, we see a clear winner in terms of performance over float32
The difference in runtime between float16 and bfloat16 can be attributed to some fluctuations in-between runs. While fp16 offers higher decimal accuracy, bfloat16 provides a wider range, making it safer in cases where large values are involved. In our tests, both fp16 and bfloat16 significantly reduced memory usage to around 16GB and improved token generation speed.
Quick Intro to Transformers and why we chose flash attention
Transformer architecture is the backbone of the most modern LLMs, like GPT, Claude, or LLAMA. Unlike older models that might have lost track of words in each sentence, Transformers maintain a comprehensive understanding of the entire context. They have grown larger and deeper, but equipping them with longer context remains difficult.
A considerable amount of time is spent on computing Multi-Head Attention, which basically allows the model to look at the sentence from different perspectives. Due to their challenges with runtime and memory when it comes to long sequencing, the field has rapidly switched over to the more memory and runtime efficient variant of Attention introduced by researchers in 2022. To describe the inner workings of this variant, we would need a whole separate blog post. This new approach highlights that self-attention has a big drawback: it scales quadratically with respect to the input sequence length, making it slow as sequences get longer. This is where Flash Attention comes in, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. Flash attention provides a way to compute attention with linear scaling, as well as reduced memory footprint.
Flash Attention reduces memory usage and speeds up computation by allowing attention calculations to scale linearly with input length. In our tests, TTFT improved noticeably for longer prompts, and while the reduction in TPT was smaller, it accumulated over time, making it a valuable optimization for real-world applications.
vLLM
Efficiently deploying and serving LLMs at scale is no easy task, but using an open-source library like vLLM simplifies this process by providing a high-performance serving solution. Traditional approaches to aiding model efficiency and inferencing, particularly in production environments, often suffer from high latency, excessive memory usage, and underutilization of hardware resources. vLLM is a high-performance library designed to optimize the serving of LLMs, making it an essential tool for anyone handling model serving at scale. vLLM addresses these issues by introducing optimizations like Paged Attention, dynamic memory management, tensor and pipeline parallelism, enabling faster inference. This library aims to achieve optimal performance and scalability, ensuring that LLMs can be deployed efficiently in production environments.
Although it consists of many optimization techniques, the key to their success is their introduction of Paged Attention. Paged Attention is a memory optimized Attention for inference, used to efficiently use and manage where a GPU stores its data–the GPU VRAM. In traditional attention, every token's attention scores and related data are stored simultaneously in memory, which can lead to excessive memory usage, especially with long input sequences. Paged Attention solves this by dynamically loading and unloading "pages" of attention data and allows storing continuous keys and values in non-contiguous memory space, similar to how virtual memory works in operating systems. It partitions the KV cache of sequences, split by a number of tokens, into blocks, and at runtime the Paged Attention kernel accesses the blocks needed by a lookup table, allowing for a better memory management system and less memory waste (the only potential place to waste memory is in the last block, if the number of remaining tokens is less than the number of tokens in a block). This approach allows models to handle longer sequences more efficiently, enabling the processing of larger inputs with a boost in performance.
A significant uplift in terms of performance is seen with little to no work. Both TTFT and TPT are lower, providing faster context load and generation time. With its ease of use, great out of the box performance and ease of serving, this is the go-to tool for optimizing and serving LLMs when on a time\engineering limit.
TensorRT-LLM
Next we have TensorRT-LLM, an advanced library developed by NVIDIA to accelerate the inference LLMs using TensorRT, a high-performance deep learning inference platform. Specifically designed to harness the power of NVIDIA GPUs, TensorRT-LLM optimizes LLMs by applying techniques such as mixed precision, kernel fusion, and layer-specific optimizations (to name a few), significantly reducing latency and boosting throughput. Like vLLM, this library is particularly valuable for deploying LLMs in production environments, because it allows developers to achieve faster inference times, lower operational costs, and maximize the performance potential of their hardware.
The library provides a vast number of possible optimization techniques, giving full control to the developer over the performance of LLMs. But full control is both a pro and a con: with great power comes great responsibility. It is a difficult tool for the inexperienced, as the entry barrier is quite high. A good understanding of most optimization techniques and the model at hand is required to achieve optimal results with TensorRT-LLM. It’s also easy to serve using Triton Inference Server, and will ultimately yield the best performance when utilized skillfully.
Although the TTFT is close to identical to vLLM, a significant uplift in terms of performance can be seen in TPOT, providing the best performance out of the techniques explored above.
llama.cpp
llama.cpp is a highly optimized C++ library designed to run LLMs efficiently on a wide range of hardware, including Apple M-series ARM processors. Unlike most frameworks that often rely on GPUs, llama.cpp is built in C++ to enable the execution of LLMs directly on CPUs with high performance. This makes it possible to deploy models on edge devices or personal computers. The library is especially useful for developers who need to integrate LLMs into applications with constrained hardware requirements, or for personal offline use.
One of the biggest factors which enables the use on edge devices is quantization. Quantization is the reduction of the model’s size by reducing precision. By sacrificing performance and storing weights in a reduced size, going as low as 2-bit per weight, the model’s footprint is reduced and performance improved. Sacrificing performance might seem like a big problem, but in reality, reducing the weight precision from 32 bits to 6 bits results in a model, which retains very high quality. With a large community behind it, it is easy to find a quantization of a model of your liking, but even if it’s not available, it is easy to quantize models locally with llama.cpp.
For this experiment, 8 bit quantization was used, as the current support for fp16 in llama.cpp is not great. This is not a fair comparison to the other approaches, as it’s using reduced precision, so this is just showcasing quantization, into which we will dive deeper in a later post. With the highly optimized runtime, llama.cpp provides significant improvement in TPOT, sacrificing TTFT in cases when using longer context.
Prompt caching
One of the lesser-known but highly effective techniques to improve the efficiency of LLMs in production settings is prompt caching. When deploying models for repeated tasks, such as handling frequently recurring queries or providing similar responses, the caching of prompts can lead to significant speed gains. Briefly, prompt caching allows for storing the outputs of repeated prompts, preventing the model from regenerating the same tokens for identical inputs.
This strategy works particularly well in scenarios where users send similar or identical requests multiple times, such as customer support systems or automated content generation pipelines. By using prompt caching, the TTFT and TPOT can be reduced, specifically the TTFT is close to zero, since it is cached, and the model only needs to generate responses for new prompts without processing the input prompt. Additionally, TPOT improves due to the model skipping token generation for repeated sections. However, it is important to note that this technique's effectiveness heavily depends on the use case. For highly dynamic or unique inputs, the impact of caching is low.
Final comparison
When we put results from these optimizations side by side, we can see the extent to which different optimization techniques can clearly impact LLM performance. The model using float 32 weights is the slowest by far in both TTFT and TPOT, making it unsuitable for real-world production where speed is crucial. Switching to float16 and bfloat16 precision offers substantial improvements, with both options reducing memory usage and increasing processing speed. Float16 with FlashAttention further optimizes performance, especially for longer prompts, making it an excellent choice for tasks that require handling larger inputs efficiently.
For low-latency applications, vLLM stands out by delivering the fastest TTFT, which is essential for real-time or interactive tasks like chatbots or live systems, with its ease of use and ease of deployment, it is the go-to method for fast LLMs in production, when it’s not feasible to use TensorRT. On the other hand, TensorRT provides the best overall performance, particularly excelling in TPOT for larger inputs, thanks to its deep optimizations tailored for NVIDIA GPUs. Although TensorRT has a higher barrier to entry due to its complexity, it delivers the highest throughput, making it the go-to solution when raw performance and scaling are key considerations.
Beyond performance and cost, running an LLM in-house also offers greater data privacy, which can be critical for industries handling sensitive information, like healthcare or finance. In-house models allow companies to process data securely without relying on external APIs, significantly reducing the risk of data exposure. By deploying optimized open-source LLMs on private hardware, organizations can retain full control over their data and ensure compliance with strict privacy regulations.
In conclusion, it’s clear that different optimization techniques offer us varying benefits depending on the use case. For most real-world applications, reducing precision to float16 or bfloat16 is a straightforward way to gain substantial performance improvements. Flash Attention further enhances speed, especially for tasks involving long input sequences. For those looking to deploy LLMs quickly, vLLM offers a great balance of performance and ease of use, while TensorRT-LLM remains the best option for those with the technical expertise to fully leverage its capabilities. On the other hand, llama.cpp enables efficient LLM deployment on edge devices, and prompt caching can be a game-changer for applications with repetitive inputs. Taken together, these strategies make it possible to run powerful LLMs efficiently, cost-effectively, and with enhanced privacy by keeping data in-house.