Running LLama 3 LLM with vLLM Library at Scale

Mohit Kumar
3 min readJun 30, 2024

--

Large Language Models (LLMs) have revolutionized the field of artificial intelligence, enabling machines to understand and generate human-like text. Meta’s LLama 3 is one of the latest advancements in this domain, offering improved accuracy and cost-efficiency. This article will guide you through the process of running LLama 3 using the vLLM library, which is designed for efficient LLM inference and deployment at scale.

Hugging Face Llama3 Model link here

Why Choose vLLM?

How vLLM works?

vLLM stands for Virtual Large Language Model and is an open-source library developed to optimize memory management and inference speed for large language models. It leverages advanced techniques such as PagedAttention and continuous batching to handle high-throughput and memory-constrained workloads efficiently.

Setting Up vLLM

Prerequisites

Before you begin, ensure you have the following:

  1. A server with a compatible GPU (e.g., NVIDIA A10, A100, or AMD Mi300).
  2. Python installed on your system.
  3. Access to the internet for downloading the model and dependencies.

First, install the vLLM library using pip:

pip install vllm

If you plan to perform distributed inference, you will also need to install Ray:

pip install ray

Loading and Running LLama 3

To load the LLama 3 model into vLLM, use the following Python code:

from vllm import LLM

# Load the LLama 3 8B model
model = LLM("meta-llama/Meta-Llama-3-8B-Instruct")

You can run inference using the model as shown below:

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Prepare the input message
messages = [{"role": "user", "content": "What is the capital of France?"}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate the output
output = model.generate(formatted_prompt)
print(output)

Deploying at Scale

For consistent deployment across diverse environments, you can use pre-built container images. This method supports both on-premises and cloud-based deployments.

docker pull vllm/vllm:latest
docker run -d -p 8000:8000 vllm/vllm:latest

vLLM provides a RESTful API for easy integration with other systems. Start the API server with the following command:

python -m vllm.entrypoints.api_server --model meta-llama/Meta-Llama-3-8B-Instruct

You can then interact with the model via HTTP endpoints:

curl http://localhost:8000/generate -d '{ "prompt": "Dell is", "use_beam_search": true, "n": 5, "temperature": 0 }'

Distributed Inference

For large-scale deployments, you can use Ray to distribute the inference workload across multiple GPUs:

from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

# Define the engine arguments
args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=8, # Number of GPUs
trust_remote_code=True,
enforce_eager=True
)

# Initialize the engine
engine = AsyncLLMEngine.from_engine_args(args)

# Run inference
async def generate(prompt: str, **sampling_params):
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid

sampling_params = SamplingParams(**sampling_params)
request_id = random_uuid()
results_generator = engine.generate(prompt, sampling_params, request_id)
return results_generator

Conclusion

Running LLama 3 with the vLLM library offers a powerful and efficient solution for deploying large language models at scale. By leveraging advanced techniques like PagedAttention and continuous batching, vLLM ensures high throughput and optimal memory usage. Whether you are deploying on-premises or in the cloud, vLLM provides the flexibility and performance needed to handle demanding AI applications.

If you want to connect with me on LinkedIn connect here, on twitter connect here.

--

--

Mohit Kumar
Mohit Kumar

Written by Mohit Kumar

Machine Learning Engineer at Sirion || LLM + LLMOps, MLOps & Infrastructure

No responses yet