Skip to main content

Quick start

This guide walks you through deploying your first NIM microservice using the NVIDIA NIM Operator.

Prerequisites

Before you begin, ensure you have:
1

Kubernetes cluster

A Kubernetes cluster running version 1.28 or higher with GPU nodes.
2

NVIDIA GPU Operator

The NVIDIA GPU Operator installed to provide GPU device plugins and drivers.
3

NIM Operator

The NVIDIA NIM Operator installed in your cluster. See the installation guide.
4

NGC API key

An NGC API key from NVIDIA NGC. This is required to pull NIM container images and model artifacts.
5

Storage class

A StorageClass configured in your cluster for persistent volume claims (for model caching).
Each NIM requires at least one GPU. Ensure your cluster has available GPU resources.

Step 1: Create a namespace

Create a dedicated namespace for your NIM deployments:
kubectl create namespace nim-service

Step 2: Create NGC secrets

Create Kubernetes secrets containing your NGC credentials:
kubectl create secret generic ngc-api-secret \
  --from-literal=NGC_API_KEY=<your-ngc-api-key> \
  -n nim-service
Replace <your-ngc-api-key> with your actual NGC API key.

Step 3: Deploy a NIM microservice

Deploy a LLaMA 3.2 1B model using NIMCache and NIMService resources:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
Save this to a file (e.g., llama-nim.yaml) and apply it:
kubectl apply -f llama-nim.yaml
Model caching can take several minutes depending on your network speed and the model size. The NIMCache job downloads and processes the model artifacts.

Step 4: Monitor the deployment

Watch the NIMCache job complete:
kubectl get nimcache -n nim-service
kubectl get jobs -n nim-service
kubectl logs -n nim-service job/meta-llama-3-2-1b-instruct -f
Once the cache is ready, check the NIMService status:
kubectl get nimservice -n nim-service
kubectl describe nimservice meta-llama-3-2-1b-instruct -n nim-service
Expected output when ready:
NAME                          STATUS   AGE
meta-llama-3-2-1b-instruct    Ready    5m

Step 5: Verify the deployment

Check that the NIMService pod is running:
kubectl get pods -n nim-service
You should see a pod with status Running:
NAME                                          READY   STATUS    RESTARTS   AGE
meta-llama-3-2-1b-instruct-7d9f8c5b6d-x9k2p   1/1     Running   0          3m
Check the pod logs:
kubectl logs -n nim-service -l app=meta-llama-3-2-1b-instruct

Step 6: Test the inference endpoint

Port-forward the service to your local machine:
kubectl port-forward -n nim-service svc/meta-llama-3-2-1b-instruct 8000:8000
Test the health endpoint:
curl http://localhost:8000/v1/health/ready
Send an inference request:
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama-3-2-1b-instruct",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'
The NIM microservices expose an OpenAI-compatible API, making them easy to integrate with existing applications.

Understanding the deployment

Let’s break down what we deployed:

NIMCache resource

The NIMCache resource handles model artifact caching:
  • source.ngc - Specifies NGC as the model source
  • modelPuller - Container image that downloads the model
  • pullSecret - Docker registry credentials for pulling images
  • authSecret - NGC API key for authentication
  • model.engine - Inference engine (tensorrt_llm)
  • model.tensorParallelism - Number of GPUs for tensor parallelism
  • storage.pvc - Persistent volume claim configuration

NIMService resource

The NIMService resource deploys the inference service:
  • image - NIM container image and tag
  • authSecret - NGC API key (required at runtime)
  • storage.nimCache - References the NIMCache for model artifacts
  • replicas - Number of pod replicas
  • resources.limits - GPU resources required
  • expose.service - Service type and port configuration

Next steps

Configure autoscaling

Enable horizontal pod autoscaling for dynamic scaling

Expose via Ingress

Configure Ingress or Gateway API for external access

Multi-model pipelines

Orchestrate RAG pipelines with multiple models

Production deployment

Best practices for production deployments

Troubleshooting

NIMCache job fails

If the caching job fails, check:
  1. NGC credentials are correct
  2. Storage class exists and can provision volumes
  3. Sufficient disk space is available
kubectl describe nimcache meta-llama-3-2-1b-instruct -n nim-service
kubectl logs -n nim-service job/meta-llama-3-2-1b-instruct

NIMService pod not starting

If the pod doesn’t start, verify:
  1. GPU resources are available: kubectl describe nodes
  2. Image pull secrets are configured correctly
  3. NIMCache completed successfully
kubectl describe pod -n nim-service -l app=meta-llama-3-2-1b-instruct

Pod is running but not ready

If the pod is running but not passing readiness checks:
  1. Check the startup probe timeout (default 20 minutes)
  2. Verify the model loaded correctly from cache
  3. Check GPU allocation
kubectl logs -n nim-service -l app=meta-llama-3-2-1b-instruct --tail=100

Clean up

To remove the deployment:
kubectl delete nimservice meta-llama-3-2-1b-instruct -n nim-service
kubectl delete nimcache meta-llama-3-2-1b-instruct -n nim-service
kubectl delete namespace nim-service
Deleting the NIMCache will also delete the persistent volume claim and cached model artifacts.

Build docs developers (and LLMs) love