Skip to main content
The NVIDIA NIM Operator is a Kubernetes operator that automates the deployment and management of NVIDIA Inference Microservices (NIMs) and NVIDIA NeMo microservices. It extends Kubernetes with custom resources and controllers that handle the full lifecycle of AI inference workloads.

What is the operator pattern?

Kubernetes operators are software extensions that use custom resources to manage applications and their components. They follow the controller pattern:
  1. Observe - Watch for changes to custom resources
  2. Analyze - Compare current state with desired state
  3. Act - Take actions to reconcile differences
The NIM Operator implements this pattern to automate complex deployment and management tasks for AI inference services.

NIM and NeMo microservices

The operator manages two primary families of microservices:

NVIDIA Inference Microservices (NIMs)

NIMs are optimized containers for serving AI models with:
  • Pre-built inference engines (TensorRT-LLM, vLLM)
  • Optimized model profiles for specific GPU configurations
  • OpenAI-compatible APIs
  • Built-in model caching and optimization

NVIDIA NeMo Microservices

NeMo microservices provide additional AI capabilities:
  • Customizer - Fine-tune and customize foundation models
  • Guardrails - Apply safety controls and content filtering
  • Evaluator - Evaluate model performance and quality
  • Datastore - Manage datasets and model artifacts
  • Entitystore - Store and retrieve entity information

Operator responsibilities

The NIM Operator handles:

Resource lifecycle

Create, update, and delete Kubernetes resources (Deployments, Services, ConfigMaps, etc.) based on custom resource specifications

Model caching

Automate the download and caching of AI models from NVIDIA NGC or other sources to persistent storage

Platform integration

Support multiple inference platforms including standalone deployments and KServe integration

Scaling and monitoring

Configure autoscaling, health checks, and metrics collection for inference services

Storage management

Handle persistent volumes, NIMCache volumes, and storage configurations for models

Network exposure

Set up Services, Ingress, and Gateway routes for accessing inference endpoints

Key benefits

Deploy complex AI inference stacks with simple YAML manifests. The operator handles all the underlying Kubernetes resources automatically.
Apply the same operational patterns across different types of AI services. The operator ensures consistency in how services are deployed, scaled, and monitored.
Choose between standalone deployments or KServe integration. Switch platforms by changing a single field in the custom resource.
Get sensible defaults for health checks, resource limits, security contexts, and other production requirements.
Deploy models requiring multiple GPUs across nodes using LeaderWorkerSet for tensor and pipeline parallelism.

How it works

When you create a NIMService or NeMo microservice custom resource:
  1. The operator’s controller detects the new resource
  2. It validates the specification and sets default values
  3. It creates necessary Kubernetes resources (Deployments, Services, etc.)
  4. It monitors the deployment status and updates the custom resource status
  5. It continuously watches for changes and reconciles state
The operator uses a declarative approach - you describe the desired state in custom resources, and the operator ensures the cluster matches that state.

Next steps

Architecture

Learn about the operator’s internal architecture and components

Custom resources

Explore all available custom resource definitions (CRDs)

Build docs developers (and LLMs) love