Architecture

The NVIDIA NIM Operator is built using the Kubernetes controller-runtime framework and follows cloud-native patterns for managing AI inference workloads.

Operator components

The operator consists of several key components:

Manager

The main process that orchestrates all controllers and webhooks. It handles leader election, health checks, and metrics serving.

Controllers

Independent reconciliation loops for each custom resource type. Each controller watches its resource and maintains desired state.

Webhooks

Admission webhooks that validate and mutate custom resources before they are persisted to etcd.

Renderer

Template engine that generates Kubernetes manifests from custom resource specifications.

Controller architecture

Each custom resource has a dedicated controller that implements the reconciliation loop:

Available controllers

The operator includes these controllers:

NIMServiceReconciler - Manages NIM inference services (cmd/main.go:201)
NIMCacheReconciler - Handles model caching jobs (cmd/main.go:192)
NIMPipelineReconciler - Orchestrates NIM service pipelines (cmd/main.go:213)
NemoCustomizerReconciler - Manages model customization services (cmd/main.go:278)
NemoGuardrailReconciler - Controls guardrail services (cmd/main.go:230)
NemoEvaluatorReconciler - Handles evaluation services (cmd/main.go:242)
NemoDatastoreReconciler - Manages datastore services (cmd/main.go:266)
NemoEntitystoreReconciler - Controls entitystore services (cmd/main.go:254)

Reconciliation loop

The reconciliation loop is the core of the operator’s functionality:

Event trigger

A change to a custom resource or owned resource triggers reconciliation

Fetch resource

The controller fetches the latest version of the custom resource from the API server

Handle deletion

If the resource is being deleted, execute cleanup logic and remove finalizers

Validate spec

Validate the resource specification and apply default values

Render manifests

Generate Kubernetes manifests (Deployment, Service, etc.) from templates

Apply resources

Create or update the generated resources in the cluster

Update status

Reflect the current state in the custom resource’s status field

Requeue or complete

Schedule future reconciliation if needed, or complete successfully

Platform abstraction

The operator supports multiple inference platforms through an abstraction layer:

Standalone
KServe

Direct Kubernetes deployments with:

Standard Deployment or LeaderWorkerSet resources
Native Kubernetes Services and Ingress
Direct GPU resource management

Located in: internal/controller/platform/standalone/

Integration with KServe for:

InferenceService resources
Advanced traffic management
Model serving protocols

Located in: internal/controller/platform/kserve/

The platform is selected via the inferencePlatform field in NIMService (defaults to “standalone”).

Resource management

The operator manages these types of Kubernetes resources:

Core workload resources

Deployment - Standard single-node NIM deployments
LeaderWorkerSet - Multi-node deployments with MPI coordination
StatefulSet - Stateful services requiring stable identities
Job - One-time tasks like model caching

Networking resources

Service - ClusterIP, LoadBalancer, or NodePort services
Ingress - HTTP/HTTPS ingress rules
HTTPRoute/GRPCRoute - Gateway API routes

Storage resources

PersistentVolumeClaim - Model storage and caching
ConfigMap - Configuration data
Secret - Sensitive credentials and keys

RBAC resources

ServiceAccount - Service identity
Role/RoleBinding - Permissions for service accounts
SecurityContextConstraints - OpenShift security policies

Scaling and monitoring

HorizontalPodAutoscaler - Automatic scaling based on metrics
ServiceMonitor - Prometheus metrics collection

Webhook validation

Admission webhooks validate custom resources before they are created or updated:

// Enabled via ENABLE_WEBHOOKS environment variable
if enableWebhooks {
    SetupNIMCacheWebhookWithManager(mgr)
    SetupNIMServiceWebhookWithManager(mgr)
}

Webhooks enforce:

Required field validation
Cross-field validation rules (e.g., replicas vs autoscaling)
Immutability constraints (e.g., DRA resources)
Default value injection

Conditions and status

The operator uses conditions to communicate resource state:

status:
  state: Ready
  availableReplicas: 3
  conditions:
    - type: Ready
      status: "True"
      reason: DeploymentReady
      message: All replicas are ready
  model:
    name: llama-3-8b-instruct
    clusterEndpoint: http://nim-service.default.svc:8000

High availability

The operator supports high availability through:

Leader election - Only one active manager instance reconciles resources
Lease-based coordination - Using Kubernetes lease resources
Fast failover - New leader elected quickly when current leader fails

Configured in main.go with LeaderElection: true and LeaderElectionID: "a0715c6e.nvidia.com"

Metrics and observability

The operator exposes metrics on port 8080 (configurable):

Metrics include controller queue depth, reconciliation duration, and error rates for monitoring operator health.

Health checks are available at:

/healthz - Liveness probe
/readyz - Readiness probe

Filtered caching

To reduce memory usage, the operator uses filtered caching for resources:

// Only cache resources managed by the operator
ls := labels.SelectorFromSet(labels.Set{
    "app.kubernetes.io/managed-by": "k8s-nim-operator",
})

This ensures the operator only watches resources it manages, improving scalability.

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Operator components

Manager

Controllers

Webhooks

Renderer

Controller architecture

Available controllers

Reconciliation loop

Platform abstraction

Resource management

Webhook validation

Conditions and status

High availability

Metrics and observability

Filtered caching

Next steps

Custom resources

Deployment guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Operator components

Manager

Controllers

Webhooks

Renderer

​Controller architecture

​Available controllers

​Reconciliation loop

​Platform abstraction

​Resource management

​Webhook validation

​Conditions and status

​High availability

​Metrics and observability

​Filtered caching

​Next steps

Custom resources

Deployment guide

Build docs developers (and LLMs) love

Operator components

Controller architecture

Available controllers

Reconciliation loop

Platform abstraction

Resource management

Webhook validation

Conditions and status

High availability

Metrics and observability

Filtered caching

Next steps