Skip to main content
The NVIDIA NIM Operator is built using the Kubernetes controller-runtime framework and follows cloud-native patterns for managing AI inference workloads.

Operator components

The operator consists of several key components:

Manager

The main process that orchestrates all controllers and webhooks. It handles leader election, health checks, and metrics serving.

Controllers

Independent reconciliation loops for each custom resource type. Each controller watches its resource and maintains desired state.

Webhooks

Admission webhooks that validate and mutate custom resources before they are persisted to etcd.

Renderer

Template engine that generates Kubernetes manifests from custom resource specifications.

Controller architecture

Each custom resource has a dedicated controller that implements the reconciliation loop:

Available controllers

The operator includes these controllers:
  • NIMServiceReconciler - Manages NIM inference services (cmd/main.go:201)
  • NIMCacheReconciler - Handles model caching jobs (cmd/main.go:192)
  • NIMPipelineReconciler - Orchestrates NIM service pipelines (cmd/main.go:213)
  • NemoCustomizerReconciler - Manages model customization services (cmd/main.go:278)
  • NemoGuardrailReconciler - Controls guardrail services (cmd/main.go:230)
  • NemoEvaluatorReconciler - Handles evaluation services (cmd/main.go:242)
  • NemoDatastoreReconciler - Manages datastore services (cmd/main.go:266)
  • NemoEntitystoreReconciler - Controls entitystore services (cmd/main.go:254)

Reconciliation loop

The reconciliation loop is the core of the operator’s functionality:
1

Event trigger

A change to a custom resource or owned resource triggers reconciliation
2

Fetch resource

The controller fetches the latest version of the custom resource from the API server
3

Handle deletion

If the resource is being deleted, execute cleanup logic and remove finalizers
4

Validate spec

Validate the resource specification and apply default values
5

Render manifests

Generate Kubernetes manifests (Deployment, Service, etc.) from templates
6

Apply resources

Create or update the generated resources in the cluster
7

Update status

Reflect the current state in the custom resource’s status field
8

Requeue or complete

Schedule future reconciliation if needed, or complete successfully

Platform abstraction

The operator supports multiple inference platforms through an abstraction layer:
Direct Kubernetes deployments with:
  • Standard Deployment or LeaderWorkerSet resources
  • Native Kubernetes Services and Ingress
  • Direct GPU resource management
Located in: internal/controller/platform/standalone/
The platform is selected via the inferencePlatform field in NIMService (defaults to “standalone”).

Resource management

The operator manages these types of Kubernetes resources:
  • Deployment - Standard single-node NIM deployments
  • LeaderWorkerSet - Multi-node deployments with MPI coordination
  • StatefulSet - Stateful services requiring stable identities
  • Job - One-time tasks like model caching
  • Service - ClusterIP, LoadBalancer, or NodePort services
  • Ingress - HTTP/HTTPS ingress rules
  • HTTPRoute/GRPCRoute - Gateway API routes
  • PersistentVolumeClaim - Model storage and caching
  • ConfigMap - Configuration data
  • Secret - Sensitive credentials and keys
  • ServiceAccount - Service identity
  • Role/RoleBinding - Permissions for service accounts
  • SecurityContextConstraints - OpenShift security policies
  • HorizontalPodAutoscaler - Automatic scaling based on metrics
  • ServiceMonitor - Prometheus metrics collection

Webhook validation

Admission webhooks validate custom resources before they are created or updated:
// Enabled via ENABLE_WEBHOOKS environment variable
if enableWebhooks {
    SetupNIMCacheWebhookWithManager(mgr)
    SetupNIMServiceWebhookWithManager(mgr)
}
Webhooks enforce:
  • Required field validation
  • Cross-field validation rules (e.g., replicas vs autoscaling)
  • Immutability constraints (e.g., DRA resources)
  • Default value injection

Conditions and status

The operator uses conditions to communicate resource state:
status:
  state: Ready
  availableReplicas: 3
  conditions:
    - type: Ready
      status: "True"
      reason: DeploymentReady
      message: All replicas are ready
  model:
    name: llama-3-8b-instruct
    clusterEndpoint: http://nim-service.default.svc:8000

High availability

The operator supports high availability through:
  • Leader election - Only one active manager instance reconciles resources
  • Lease-based coordination - Using Kubernetes lease resources
  • Fast failover - New leader elected quickly when current leader fails
Configured in main.go with LeaderElection: true and LeaderElectionID: "a0715c6e.nvidia.com"

Metrics and observability

The operator exposes metrics on port 8080 (configurable):
Metrics include controller queue depth, reconciliation duration, and error rates for monitoring operator health.
Health checks are available at:
  • /healthz - Liveness probe
  • /readyz - Readiness probe

Filtered caching

To reduce memory usage, the operator uses filtered caching for resources:
// Only cache resources managed by the operator
ls := labels.SelectorFromSet(labels.Set{
    "app.kubernetes.io/managed-by": "k8s-nim-operator",
})
This ensures the operator only watches resources it manages, improving scalability.

Next steps

Custom resources

Explore all available custom resource definitions

Deployment guide

Learn how to deploy the operator

Build docs developers (and LLMs) love