Serving the model
vLLM is an open-source inference and serving engine specifically designed to optimize the performance of large language models (LLMs) through efficient memory management. As a popular inference solution in the ML community, vLLM offers several key advantages:
- Efficient Memory Management: Uses PagedAttention technology to optimize GPU/accelerator memory usage
- High Throughput: Enables concurrent processing of multiple requests
- AWS Neuron Support: Native integration with AWS Inferentia and Trainium accelerators
- OpenAI-compatible API: Provides a drop-in replacement for OpenAI's API, simplifying integration
For AWS Neuron specifically, vLLM provides:
- Native support for Neuron SDK and runtime
- Optimized memory management for Inferentia/Trainium architectures
- Continuous model loading for efficient scaling
- Integration with AWS Neuron profiling tools
For this lab, we will use the Mistral-7B-v0.3 model compiled with the neuronx-distributed-inference
framework. This model provides a good balance between capabilities and resource requirements, making it suitable for deployment on our Trainium-powered EKS cluster.
To deploy the model, we'll use a standard Kubernetes Deployment that will use a vLLM-based container image to load the model and weights:
apiVersion: v1
kind: Service
metadata:
name: mistral
namespace: vllm
annotations:
prometheus.io/scrape: "true"
prometheus.io/app-metrics: "true"
prometheus.io/port: "8080"
labels:
model: mistral7b
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model: mistral7b
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral
namespace: vllm
labels:
model: mistral7b
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
model: mistral7b
template:
metadata:
labels:
model: mistral7b
spec:
nodeSelector:
instanceType: trn1.2xlarge
neuron.amazonaws.com/neuron-device: "true"
tolerations:
- effect: NoSchedule
key: aws.amazon.com/neuron
operator: Exists
initContainers:
- name: model-download
image: python:3.11
command: ["/bin/sh", "-c"]
args:
- |
set -e
pip install -U "huggingface_hub[cli]"
pip install hf_transfer
mkdir -p /models/mistral-7b-v0.3
HF_HUB_ENABLE_HF_TRANSFER=1 hf download aws-neuron/Mistral-7B-Instruct-v0.3-seqlen-2048-bs-1-cores-2 --local-dir /models/mistral-7b-v0.3
echo ""
echo "Model download is complete."
volumeMounts:
- name: local-storage
mountPath: /models
containers:
- name: vllm
image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
[
"vllm serve /models/mistral-7b-v0.3 --tokenizer /models/mistral-7b-v0.3 --port 8080 --host 0.0.0.0 --device neuron --tensor-parallel-size 2 --max-num-seqs 4 --use-v2-block-manager --max-model-len 2048 --dtype bfloat16",
]
ports:
- containerPort: 8080
protocol: TCP
name: http
resources:
requests:
cpu: 4
memory: 24Gi
aws.amazon.com/neuron: 1
limits:
cpu: 4
memory: 24Gi
aws.amazon.com/neuron: 1
volumeMounts:
- name: dshm
mountPath: /dev/shm
- name: local-storage
mountPath: /models
env:
- name: NEURON_RT_NUM_CORES
value: "2"
- name: NEURON_RT_VISIBLE_CORES
value: "0,1"
- name: VLLM_LOGGING_LEVEL
value: "INFO"
- name: VLLM_NEURON_FRAMEWORK
value: "neuronx-distributed-inference"
- name: NEURON_COMPILED_ARTIFACTS
value: "/models/mistral-7b-v0.3"
- name: MALLOC_ARENA_MAX
value: "1"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
terminationGracePeriodSeconds: 10
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: local-storage
hostPath:
path: /mnt/k8s-disks/0
type: Directory
Let's create the necessary resources:
We can check the resources it created:
NAME READY UP-TO-DATE AVAILABLE AGE
mistral 0/1 1 0 33s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mistral ClusterIP 172.16.149.89 <none> 8080/TCP 33m
The model initialization process takes several minutes to complete. The vLLM Pod will go through the following stages:
- Remain in a Pending state until Karpenter provisions the Trainium instance
- Use an init container to download the model from Hugging Face to a host file system path
- Download the vLLM container image (approximately 10GB)
- Start the vLLM service
- Load the model from the file system
- Begin serving the model via an HTTP endpoint on port 8080
You can either monitor the status of the Pod as it progresses through these stages or proceed to the next section while the model loads.
If you choose to wait, you can watch the Pod transition to the Init state (press Ctrl + C to exit):
NAME READY STATUS RESTARTS AGE
mistral-6889d675c5-2l6x2 0/1 Pending 0 21s
mistral-6889d675c5-2l6x2 0/1 Pending 0 29s
mistral-6889d675c5-2l6x2 0/1 Pending 0 29s
mistral-6889d675c5-2l6x2 0/1 Pending 0 30s
mistral-6889d675c5-2l6x2 0/1 Pending 0 38s
mistral-6889d675c5-2l6x2 0/1 Pending 0 50s
mistral-6889d675c5-2l6x2 0/1 Init:0/1 0 50s
# Exit once the Pod reaches the Init state
You can check the logs for the init container that's downloading the model (press Ctrl + C to exit):
[...]
Downloading 'weights/tp0_sharded_checkpoint.safetensors' to '/models/mistral-7b-v0.3/.cache/huggingface/download/weights/dAuF3Bw92r-GdZ-yzT84Iweq-RQ=.6794a3d7f2b1d071399a899a42bcd5652e83ebdd140f02f562d90b292ae750aa.incomplete'
Download complete. Moving file to /models/mistral-7b-v0.3/weights/tp0_sharded_checkpoint.safetensors
Downloading 'weights/tp1_sharded_checkpoint.safetensors' to '/models/mistral-7b-v0.3/.cache/huggingface/download/weights/eEdQSCIfRYQ2putRDwZhjh7Te8E=.14c5bd3b07c4f4b752a65ee99fe9c79ae0110c7e61df0d83ef4993c1ee63a749.incomplete'
Download complete. Moving file to /models/mistral-7b-v0.3/weights/tp1_sharded_checkpoint.safetensors
Model download is complete.
# Exit once the logs reach this point
Once the init container completes, you can monitor the vLLM container logs as it starts (press Ctrl + C to exit):
[...]
INFO 09-30 04:43:37 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 09-30 04:43:37 [launcher.py:36] Route: /invocations, Methods: POST
INFO 09-30 04:43:37 [launcher.py:36] Route: /metrics, Methods: GET
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 10.42.114.242:38674 - "GET /health HTTP/1.1" 200 OK
INFO: 10.42.114.242:50134 - "GET /health HTTP/1.1" 200 OK
# Exit once the logs reach this point
After completing these steps or deciding to move on while the model initializes, you can proceed to the next task.