NIMPipeline Resource

Overview

The NIMPipeline custom resource enables orchestration of multiple NVIDIA NIM services with automatic dependency management. It simplifies deploying complex AI workflows like RAG (Retrieval-Augmented Generation) pipelines, guardrail systems, and multi-model inference chains.

What is NIMPipeline?

NIMPipeline provides:

Declarative multi-service deployment
Automatic service dependency injection via environment variables
Coordinated lifecycle management across services
Service-level enable/disable toggles
Unified status monitoring for all pipeline services

Basic Concept

A NIMPipeline deploys multiple NIMService resources and automatically configures their dependencies. Each service can reference other services in the pipeline, and the operator injects the appropriate endpoint URLs as environment variables.

Basic Example: RAG Pipeline

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: rag-pipeline
  namespace: nim-service
spec:
  services:
    - name: meta-llama3-8b-instruct
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
          tag: 1.3.3
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: meta-llama3-8b-instruct
            profile: ''
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
    
    - name: nv-embedqa-1b-v2
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: nv-embedqa-1b-v2
            profile: ''
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
            grpcPort: 8001
            metricsPort: 8002
    
    - name: nv-rerankqa-1b-v2
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: nv-rerankqa-1b-v2
            profile: ''
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
            grpcPort: 8001
            metricsPort: 8002

Service Configuration

spec.services

array

required

List of NIM services to deploy as part of the pipeline.

Show items

name

string

required

Unique name for the service within the pipeline

enabled

boolean

default:"true"

Whether this service is enabled and should be deployed

spec

object

required

Complete NIMService specification. Supports all fields from the NIMService resource.

dependencies

array

List of service dependencies for automatic endpoint injection.

Show items

name

string

required

Name of the dependent service (must match another service in the pipeline)

port

integer

required

Port number of the dependent service

envName

string

Environment variable name for the endpoint. If not specified, uses a generated name.

envValue

string

Custom endpoint value. If not specified, uses the generated service endpoint URL.

Service Dependencies

Define dependencies between services to automatically inject endpoint URLs as environment variables.

Dependency Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: guardrail-pipeline
  namespace: nim-service
spec:
  services:
    - name: llm-service
      enabled: true
      dependencies:
        - name: content-safety
          port: 8000
          envName: CONTENT_SAFETY_ENDPOINT
        - name: jailbreak-detect
          port: 8000
          envName: JAILBREAK_ENDPOINT
      spec:
        image:
          repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
          tag: 1.3.3
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: llm-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
    
    - name: content-safety
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/content-safety
          tag: latest
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: content-safety-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
    
    - name: jailbreak-detect
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/jailbreak-detection
          tag: latest
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: jailbreak-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000

In this example, the llm-service will have two environment variables injected:

CONTENT_SAFETY_ENDPOINT=http://content-safety.nim-service.svc.cluster.local:8000
JAILBREAK_ENDPOINT=http://jailbreak-detect.nim-service.svc.cluster.local:8000

Advanced Examples

Conditionally Enabled Services

You can selectively enable or disable services in the pipeline:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: flexible-pipeline
  namespace: nim-service
spec:
  services:
    - name: llm-service
      enabled: true
      spec:
        # ... llm configuration
    
    - name: embedding-service
      enabled: true  # Active service
      spec:
        # ... embedding configuration
    
    - name: experimental-service
      enabled: false  # Disabled, won't be deployed
      spec:
        # ... experimental configuration

Multi-Port Service Dependencies

services:
  - name: triton-service
    enabled: true
    spec:
      image:
        repository: nvcr.io/nim/nvidia/triton-nim
        tag: latest
        pullSecrets:
        - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: triton-cache
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
          grpcPort: 8001
          metricsPort: 8002
  
  - name: client-service
    enabled: true
    dependencies:
      - name: triton-service
        port: 8000
        envName: TRITON_HTTP_ENDPOINT
      - name: triton-service
        port: 8001
        envName: TRITON_GRPC_ENDPOINT
    spec:
      # ... client configuration

Custom Dependency Endpoints

services:
  - name: app-service
    enabled: true
    dependencies:
      - name: external-service
        port: 8000
        envName: EXTERNAL_API
        envValue: https://external-api.example.com/v1  # Custom external endpoint
    spec:
      # ... app configuration

Pipeline Status

The NIMPipeline status provides aggregate information about all services:

status:
  state: Ready  # Overall pipeline state: NotReady, Ready, or Failed
  states:
    meta-llama3-8b-instruct: Ready
    nv-embedqa-1b-v2: Ready
    nv-rerankqa-1b-v2: Ready
  conditions:
  - type: NIM_PIPELINE_READY
    status: "True"
    lastTransitionTime: "2024-03-03T10:15:30Z"
    reason: AllServicesReady
    message: All pipeline services are ready

Status States

NotReady

One or more services are not ready

Ready

All enabled services are ready and operational

Failed

One or more services have failed

Common Pipeline Patterns

RAG (Retrieval-Augmented Generation)

Embedding Service

Converts documents and queries into vector embeddings

Reranking Service

Reorders retrieved documents based on relevance

LLM Service

Generates responses using retrieved context

Guardrail Pipeline

Content Safety

Filters harmful or inappropriate content

Jailbreak Detection

Detects prompt injection attempts

Topic Control

Ensures responses stay on-topic

LLM Service

Generates safe, controlled responses

Complete Working Example

Here’s a production-ready RAG pipeline with all necessary components:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMPipeline
metadata:
  name: production-rag
  namespace: nim-service
spec:
  services:
    - name: llm
      enabled: true
      dependencies:
        - name: embedding
          port: 8000
          envName: EMBEDDING_ENDPOINT
        - name: reranking
          port: 8000
          envName: RERANKING_ENDPOINT
      spec:
        image:
          repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
          tag: 1.3.3
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: llm-cache
        replicas: 2
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            cpu: "4"
            memory: 16Gi
        expose:
          service:
            type: ClusterIP
            port: 8000
          router:
            hostDomainName: example.com
            ingress:
              ingressClass: nginx
        metrics:
          enabled: true
          serviceMonitor:
            additionalLabels:
              release: prometheus
        scale:
          enabled: true
          hpa:
            minReplicas: 2
            maxReplicas: 5
            metrics:
            - type: Resource
              resource:
                name: cpu
                target:
                  type: Utilization
                  averageUtilization: 70
    
    - name: embedding
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: embedding-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            cpu: "2"
            memory: 8Gi
        expose:
          service:
            type: ClusterIP
            port: 8000
            grpcPort: 8001
    
    - name: reranking
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: reranking-cache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            cpu: "2"
            memory: 8Gi
        expose:
          service:
            type: ClusterIP
            port: 8000

Deployment Workflow

Create NIMCache Resources

Pre-cache all models required by the pipeline services:

kubectl apply -f nimcaches.yaml
kubectl wait --for=condition=ready nimcache --all -n nim-service --timeout=30m

Deploy NIMPipeline

Apply the pipeline definition:

kubectl apply -f pipeline.yaml

Monitor Pipeline Status

Watch the pipeline until all services are ready:

kubectl get nimpipeline -n nim-service -w
kubectl get nimservice -n nim-service

Verify Service Endpoints

Check that dependencies are properly injected:

kubectl get pods -n nim-service -l app=llm
kubectl exec -it <llm-pod> -n nim-service -- env | grep ENDPOINT

Best Practices

Cache Models First

Always create and verify NIMCache resources before deploying the pipeline to ensure fast service startup.

Use Descriptive Names

Give services clear, descriptive names that reflect their purpose in the pipeline (e.g., llm, embedding, reranking).

Configure Health Checks

Customize readiness and startup probes for each service based on model size and initialization time.

Plan Resource Allocation

Allocate GPU and memory resources appropriately for each service. Embedding and reranking models typically require fewer resources than LLMs.

Enable Monitoring

Configure metrics and ServiceMonitor for all services to track performance and identify bottlenecks.

Use Selective Enablement

Use the enabled field to quickly enable/disable services during development and testing.

Troubleshooting

Pipeline Not Ready

Check individual service statuses:

kubectl get nimpipeline production-rag -n nim-service -o jsonpath='{.status.states}' | jq
kubectl get nimservice -n nim-service

Dependency Injection Not Working

Verify the environment variables in the pod:

kubectl exec -it <pod-name> -n nim-service -- env | grep -i endpoint

Service Communication Errors

Check service DNS resolution:

kubectl run -it --rm debug --image=busybox --restart=Never -n nim-service -- nslookup embedding.nim-service.svc.cluster.local

NIMService Resource - Individual NIM service configuration
NIMCache Resource - Model caching for pipeline services

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Overview

What is NIMPipeline?

Basic Concept

Basic Example: RAG Pipeline

Service Configuration

Service Dependencies

Dependency Example

Advanced Examples

Conditionally Enabled Services

Multi-Port Service Dependencies

Custom Dependency Endpoints

Pipeline Status

Status States

NotReady

Ready

Failed

Common Pipeline Patterns

RAG (Retrieval-Augmented Generation)

Guardrail Pipeline

Complete Working Example

Deployment Workflow

Best Practices

Cache Models First

Use Descriptive Names

Configure Health Checks

Plan Resource Allocation

Enable Monitoring

Use Selective Enablement

Troubleshooting

Pipeline Not Ready

Dependency Injection Not Working

Service Communication Errors

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​What is NIMPipeline?

​Basic Concept

​Basic Example: RAG Pipeline

​Service Configuration

​Service Dependencies

​Dependency Example

​Advanced Examples

​Conditionally Enabled Services

​Multi-Port Service Dependencies

​Custom Dependency Endpoints

​Pipeline Status

​Status States

NotReady

Ready

Failed

​Common Pipeline Patterns

​RAG (Retrieval-Augmented Generation)

​Guardrail Pipeline

​Complete Working Example

​Deployment Workflow

​Best Practices

Cache Models First

Use Descriptive Names

Configure Health Checks

Plan Resource Allocation

Enable Monitoring

Use Selective Enablement

​Troubleshooting

​Pipeline Not Ready

​Dependency Injection Not Working

​Service Communication Errors

​Related Resources

Build docs developers (and LLMs) love

Overview

What is NIMPipeline?

Basic Concept

Basic Example: RAG Pipeline

Service Configuration

Service Dependencies

Dependency Example

Advanced Examples

Conditionally Enabled Services

Multi-Port Service Dependencies

Custom Dependency Endpoints

Pipeline Status

Status States

Common Pipeline Patterns

RAG (Retrieval-Augmented Generation)

Guardrail Pipeline

Complete Working Example

Deployment Workflow

Best Practices

Troubleshooting

Pipeline Not Ready

Dependency Injection Not Working

Service Communication Errors

Related Resources