Skip to main content

Overview

The NIMPipeline custom resource enables orchestration of multiple NVIDIA NIM services with automatic dependency management. It simplifies deploying complex AI workflows like RAG (Retrieval-Augmented Generation) pipelines, guardrail systems, and multi-model inference chains.

What is NIMPipeline?

NIMPipeline provides:
  • Declarative multi-service deployment
  • Automatic service dependency injection via environment variables
  • Coordinated lifecycle management across services
  • Service-level enable/disable toggles
  • Unified status monitoring for all pipeline services

Basic Concept

A NIMPipeline deploys multiple NIMService resources and automatically configures their dependencies. Each service can reference other services in the pipeline, and the operator injects the appropriate endpoint URLs as environment variables.

Basic Example: RAG Pipeline

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: rag-pipeline
  namespace: nim-service
spec:
  services:
    - name: meta-llama3-8b-instruct
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
          tag: 1.3.3
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: meta-llama3-8b-instruct
            profile: ''
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
    
    - name: nv-embedqa-1b-v2
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: nv-embedqa-1b-v2
            profile: ''
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
            grpcPort: 8001
            metricsPort: 8002
    
    - name: nv-rerankqa-1b-v2
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: nv-rerankqa-1b-v2
            profile: ''
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
            grpcPort: 8001
            metricsPort: 8002

Service Configuration

spec.services
array
required
List of NIM services to deploy as part of the pipeline.

Service Dependencies

Define dependencies between services to automatically inject endpoint URLs as environment variables.

Dependency Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: guardrail-pipeline
  namespace: nim-service
spec:
  services:
    - name: llm-service
      enabled: true
      dependencies:
        - name: content-safety
          port: 8000
          envName: CONTENT_SAFETY_ENDPOINT
        - name: jailbreak-detect
          port: 8000
          envName: JAILBREAK_ENDPOINT
      spec:
        image:
          repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
          tag: 1.3.3
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: llm-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
    
    - name: content-safety
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/content-safety
          tag: latest
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: content-safety-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
    
    - name: jailbreak-detect
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/jailbreak-detection
          tag: latest
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: jailbreak-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
In this example, the llm-service will have two environment variables injected:
  • CONTENT_SAFETY_ENDPOINT=http://content-safety.nim-service.svc.cluster.local:8000
  • JAILBREAK_ENDPOINT=http://jailbreak-detect.nim-service.svc.cluster.local:8000

Advanced Examples

Conditionally Enabled Services

You can selectively enable or disable services in the pipeline:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: flexible-pipeline
  namespace: nim-service
spec:
  services:
    - name: llm-service
      enabled: true
      spec:
        # ... llm configuration
    
    - name: embedding-service
      enabled: true  # Active service
      spec:
        # ... embedding configuration
    
    - name: experimental-service
      enabled: false  # Disabled, won't be deployed
      spec:
        # ... experimental configuration

Multi-Port Service Dependencies

services:
  - name: triton-service
    enabled: true
    spec:
      image:
        repository: nvcr.io/nim/nvidia/triton-nim
        tag: latest
        pullSecrets:
        - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: triton-cache
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
          grpcPort: 8001
          metricsPort: 8002
  
  - name: client-service
    enabled: true
    dependencies:
      - name: triton-service
        port: 8000
        envName: TRITON_HTTP_ENDPOINT
      - name: triton-service
        port: 8001
        envName: TRITON_GRPC_ENDPOINT
    spec:
      # ... client configuration

Custom Dependency Endpoints

services:
  - name: app-service
    enabled: true
    dependencies:
      - name: external-service
        port: 8000
        envName: EXTERNAL_API
        envValue: https://external-api.example.com/v1  # Custom external endpoint
    spec:
      # ... app configuration

Pipeline Status

The NIMPipeline status provides aggregate information about all services:
status:
  state: Ready  # Overall pipeline state: NotReady, Ready, or Failed
  states:
    meta-llama3-8b-instruct: Ready
    nv-embedqa-1b-v2: Ready
    nv-rerankqa-1b-v2: Ready
  conditions:
  - type: NIM_PIPELINE_READY
    status: "True"
    lastTransitionTime: "2024-03-03T10:15:30Z"
    reason: AllServicesReady
    message: All pipeline services are ready

Status States

NotReady

One or more services are not ready

Ready

All enabled services are ready and operational

Failed

One or more services have failed

Common Pipeline Patterns

RAG (Retrieval-Augmented Generation)

1

Embedding Service

Converts documents and queries into vector embeddings
2

Reranking Service

Reorders retrieved documents based on relevance
3

LLM Service

Generates responses using retrieved context

Guardrail Pipeline

1

Content Safety

Filters harmful or inappropriate content
2

Jailbreak Detection

Detects prompt injection attempts
3

Topic Control

Ensures responses stay on-topic
4

LLM Service

Generates safe, controlled responses

Complete Working Example

Here’s a production-ready RAG pipeline with all necessary components:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: production-rag
  namespace: nim-service
spec:
  services:
    - name: llm
      enabled: true
      dependencies:
        - name: embedding
          port: 8000
          envName: EMBEDDING_ENDPOINT
        - name: reranking
          port: 8000
          envName: RERANKING_ENDPOINT
      spec:
        image:
          repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
          tag: 1.3.3
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: llm-cache
        replicas: 2
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            cpu: "4"
            memory: 16Gi
        expose:
          service:
            type: ClusterIP
            port: 8000
          router:
            hostDomainName: example.com
            ingress:
              ingressClass: nginx
        metrics:
          enabled: true
          serviceMonitor:
            additionalLabels:
              release: prometheus
        scale:
          enabled: true
          hpa:
            minReplicas: 2
            maxReplicas: 5
            metrics:
            - type: Resource
              resource:
                name: cpu
                target:
                  type: Utilization
                  averageUtilization: 70
    
    - name: embedding
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: embedding-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            cpu: "2"
            memory: 8Gi
        expose:
          service:
            type: ClusterIP
            port: 8000
            grpcPort: 8001
    
    - name: reranking
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: reranking-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            cpu: "2"
            memory: 8Gi
        expose:
          service:
            type: ClusterIP
            port: 8000

Deployment Workflow

1

Create NIMCache Resources

Pre-cache all models required by the pipeline services:
kubectl apply -f nimcaches.yaml
kubectl wait --for=condition=ready nimcache --all -n nim-service --timeout=30m
2

Deploy NIMPipeline

Apply the pipeline definition:
kubectl apply -f pipeline.yaml
3

Monitor Pipeline Status

Watch the pipeline until all services are ready:
kubectl get nimpipeline -n nim-service -w
kubectl get nimservice -n nim-service
4

Verify Service Endpoints

Check that dependencies are properly injected:
kubectl get pods -n nim-service -l app=llm
kubectl exec -it <llm-pod> -n nim-service -- env | grep ENDPOINT

Best Practices

Cache Models First

Always create and verify NIMCache resources before deploying the pipeline to ensure fast service startup.

Use Descriptive Names

Give services clear, descriptive names that reflect their purpose in the pipeline (e.g., llm, embedding, reranking).

Configure Health Checks

Customize readiness and startup probes for each service based on model size and initialization time.

Plan Resource Allocation

Allocate GPU and memory resources appropriately for each service. Embedding and reranking models typically require fewer resources than LLMs.

Enable Monitoring

Configure metrics and ServiceMonitor for all services to track performance and identify bottlenecks.

Use Selective Enablement

Use the enabled field to quickly enable/disable services during development and testing.

Troubleshooting

Pipeline Not Ready

Check individual service statuses:
kubectl get nimpipeline production-rag -n nim-service -o jsonpath='{.status.states}' | jq
kubectl get nimservice -n nim-service

Dependency Injection Not Working

Verify the environment variables in the pod:
kubectl exec -it <pod-name> -n nim-service -- env | grep -i endpoint

Service Communication Errors

Check service DNS resolution:
kubectl run -it --rm debug --image=busybox --restart=Never -n nim-service -- nslookup embedding.nim-service.svc.cluster.local

Build docs developers (and LLMs) love