Serving the model

vLLM is an open-source inference and serving engine specifically designed to optimize the performance of large language models (LLMs) through efficient memory management. As a popular inference solution in the ML community, vLLM offers several key advantages:

Efficient Memory Management: Uses PagedAttention technology to optimize GPU/accelerator memory usage
High Throughput: Enables concurrent processing of multiple requests
AWS Neuron Support: Native integration with AWS Inferentia and Trainium accelerators
OpenAI-compatible API: Provides a drop-in replacement for OpenAI's API, simplifying integration

For AWS Neuron specifically, vLLM provides:

Native support for Neuron SDK and runtime
Optimized memory management for Inferentia/Trainium architectures
Continuous model loading for efficient scaling
Integration with AWS Neuron profiling tools

For this lab, we will use the Mistral-7B-v0.3 model compiled with the neuronx-distributed-inference framework. This model provides a good balance between capabilities and resource requirements, making it suitable for deployment on our Trainium-powered EKS cluster.

To deploy the model, we'll use a standard Kubernetes Deployment that will use a vLLM-based container image to load the model and weights:

~/environment/eks-workshop/modules/aiml/chatbot/vllm.yaml
apiVersion: v1
kind: Service
metadata:
  name: mistral
  namespace: vllm
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/app-metrics: "true"
    prometheus.io/port: "8080"
  labels:
    model: mistral7b
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model: mistral7b
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral
  namespace: vllm
  labels:
    model: mistral7b
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      model: mistral7b
  template:
    metadata:
      labels:
        model: mistral7b
    spec:
      nodeSelector:
        instanceType: trn1.2xlarge
        neuron.amazonaws.com/neuron-device: "true"
      tolerations:
        - effect: NoSchedule
          key: aws.amazon.com/neuron
          operator: Exists
      initContainers:
        - name: model-download
          image: python:3.11
          command: ["/bin/sh", "-c"]
          args:
            - |
              set -e              
              pip install -U "huggingface_hub[cli]"
              pip install hf_transfer

              mkdir -p /models/mistral-7b-v0.3
              HF_HUB_ENABLE_HF_TRANSFER=1 hf download aws-neuron/Mistral-7B-Instruct-v0.3-seqlen-2048-bs-1-cores-2 --local-dir /models/mistral-7b-v0.3

              echo ""
              echo "Model download is complete."
          volumeMounts:
            - name: local-storage
              mountPath: /models
      containers:
        - name: vllm
          image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04
          imagePullPolicy: IfNotPresent
          command: ["/bin/sh", "-c"]
          args:
            [
              "vllm serve /models/mistral-7b-v0.3 --tokenizer /models/mistral-7b-v0.3 --port 8080 --host 0.0.0.0 --device neuron --tensor-parallel-size 2 --max-num-seqs 4 --use-v2-block-manager --max-model-len 2048 --dtype bfloat16",
            ]
          ports:
            - containerPort: 8080
              protocol: TCP
              name: http
          resources:
            requests:
              cpu: 4
              memory: 24Gi
              aws.amazon.com/neuron: 1
            limits:
              cpu: 4
              memory: 24Gi
              aws.amazon.com/neuron: 1
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: local-storage
              mountPath: /models
          env:
            - name: NEURON_RT_NUM_CORES
              value: "2"
            - name: NEURON_RT_VISIBLE_CORES
              value: "0,1"
            - name: VLLM_LOGGING_LEVEL
              value: "INFO"
            - name: VLLM_NEURON_FRAMEWORK
              value: "neuronx-distributed-inference"
            - name: NEURON_COMPILED_ARTIFACTS
              value: "/models/mistral-7b-v0.3"
            - name: MALLOC_ARENA_MAX
              value: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 5
      terminationGracePeriodSeconds: 10
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: local-storage
          hostPath:
            path: /mnt/k8s-disks/0
            type: Directory

Let's create the necessary resources:

~$kubectl create namespace vllm

~$kubectl apply -f ~/environment/eks-workshop/modules/aiml/chatbot/vllm.yaml

We can check the resources it created:

~$kubectl get deployment -n vllm

NAME      READY   UP-TO-DATE   AVAILABLE   AGE

mistral   0/1     1            0           33s

~$kubectl get service -n vllm

NAME      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE

mistral   ClusterIP   172.16.149.89   <none>        8080/TCP   33m

The model initialization process takes several minutes to complete. The vLLM Pod will go through the following stages:

Remain in a Pending state until Karpenter provisions the Trainium instance
Use an init container to download the model from Hugging Face to a host file system path
Download the vLLM container image (approximately 10GB)
Start the vLLM service
Load the model from the file system
Begin serving the model via an HTTP endpoint on port 8080

You can either monitor the status of the Pod as it progresses through these stages or proceed to the next section while the model loads.

If you choose to wait, you can watch the Pod transition to the Init state (press Ctrl + C to exit):

~$kubectl get pod -n vllm --watch

NAME                       READY   STATUS    RESTARTS   AGE

mistral-6889d675c5-2l6x2   0/1     Pending   0          21s

mistral-6889d675c5-2l6x2   0/1     Pending   0          29s

mistral-6889d675c5-2l6x2   0/1     Pending   0          29s

mistral-6889d675c5-2l6x2   0/1     Pending   0          30s

mistral-6889d675c5-2l6x2   0/1     Pending   0          38s

mistral-6889d675c5-2l6x2   0/1     Pending   0          50s

mistral-6889d675c5-2l6x2   0/1     Init:0/1   0          50s

# Exit once the Pod reaches the Init state

You can check the logs for the init container that's downloading the model (press Ctrl + C to exit):

~$kubectl logs deployment/mistral -n vllm -c model-download -f

[...]

Downloading 'weights/tp0_sharded_checkpoint.safetensors' to '/models/mistral-7b-v0.3/.cache/huggingface/download/weights/dAuF3Bw92r-GdZ-yzT84Iweq-RQ=.6794a3d7f2b1d071399a899a42bcd5652e83ebdd140f02f562d90b292ae750aa.incomplete'

Download complete. Moving file to /models/mistral-7b-v0.3/weights/tp0_sharded_checkpoint.safetensors

Downloading 'weights/tp1_sharded_checkpoint.safetensors' to '/models/mistral-7b-v0.3/.cache/huggingface/download/weights/eEdQSCIfRYQ2putRDwZhjh7Te8E=.14c5bd3b07c4f4b752a65ee99fe9c79ae0110c7e61df0d83ef4993c1ee63a749.incomplete'

Download complete. Moving file to /models/mistral-7b-v0.3/weights/tp1_sharded_checkpoint.safetensors

Model download is complete.

# Exit once the logs reach this point

Once the init container completes, you can monitor the vLLM container logs as it starts (press Ctrl + C to exit):

~$kubectl logs deployment/mistral -n vllm -c vllm -f

[...]

INFO 09-30 04:43:37 [launcher.py:36] Route: /v2/rerank, Methods: POST

INFO 09-30 04:43:37 [launcher.py:36] Route: /invocations, Methods: POST

INFO 09-30 04:43:37 [launcher.py:36] Route: /metrics, Methods: GET

INFO:     Started server process [7]

INFO:     Waiting for application startup.

INFO:     Application startup complete.

INFO:     10.42.114.242:38674 - "GET /health HTTP/1.1" 200 OK

INFO:     10.42.114.242:50134 - "GET /health HTTP/1.1" 200 OK

# Exit once the logs reach this point

After completing these steps or deciding to move on while the model initializes, you can proceed to the next task.