Skip to main content
This guide covers performance tuning for both the Redis Operator controller and managed Redis clusters.

Operator Performance

Supported Scale

The operator has been benchmarked to support: Up to 100 clusters per operator instance (3 pods/cluster = 300 total pods)

Benchmark Results

Based on benchmarks at internal/controller/cluster/reconciler_test.go:
ClustersReconcile LatencyThroughputAPI Calls/secMemory Usage
1055.40 ms180.5 reconciles/s5,05517.96 MB/op
50751.13 ms66.57 reconciles/s1,864360.89 MB/op
1002,659.08 ms37.61 reconciles/s1,0531.36 GB/op
Benchmark environment: Linux amd64 VM (4 vCPUs, 15 GiB RAM, AMD EPYC 7B13) running Go 1.25.1 on kind v0.29.0.
Set operator resource requests/limits based on managed cluster count:
Managed ClustersCPU RequestCPU LimitMemory RequestMemory Limit
Up to 10100m500m128Mi256Mi
Up to 50250m1000m512Mi1Gi
Up to 100500m2000m1Gi2Gi
Configure in Helm values:
values.yaml
resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 2Gi
Chart defaults (100m/128Mi request, 500m/256Mi limit) are appropriate for small deployments with up to 10 clusters.

Concurrency Tuning

The operator uses a work queue with configurable concurrency:
values.yaml
extraArgs:
  - --max-concurrent-reconciles=5  # Default: 5
Tuning guidance:
  • 5 (default) - Good for most deployments up to 50 clusters
  • 10-15 - For 50-100 clusters with fast Kubernetes API server
  • 20+ - Only for dedicated operator instances with powerful nodes
Increasing concurrency beyond 15 may cause excessive API server load and GC pressure. Monitor CPU and memory usage when tuning.

Performance Profiling

Enable pprof for profiling:
values.yaml
extraArgs:
  - --pprof-bind-address=:6060
Access profiling endpoints:
kubectl port-forward -n redis-system deploy/redis-operator 6060:6060

# CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

# Memory profile
curl http://localhost:6060/debug/pprof/heap > mem.prof
go tool pprof mem.prof

# Goroutine profile
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof
go tool pprof goroutine.prof

Leader Election

The operator uses leader election for high availability:
values.yaml
extraArgs:
  - --leader-elect=true  # Default: true
  - --leader-elect-lease-duration=15s
  - --leader-elect-renew-deadline=10s
  - --leader-elect-retry-period=2s
Production recommendations:
  • Keep leader-elect=true for HA deployments
  • Use default timing values unless latency is critical
  • Run 2-3 operator replicas for redundancy

Monitoring Operator Performance

Watch reconciliation metrics:
# Reconciliation duration (p95)
histogram_quantile(0.95, rate(redis_reconcile_duration_seconds_bucket[5m]))

# Reconciliation rate
rate(redis_reconcile_duration_seconds_count[5m])

# Work queue depth
workqueue_depth{name="rediscluster"}

# Work queue latency
workqueue_queue_duration_seconds{name="rediscluster"}

Redis Instance Performance

CPU and Memory

Size Redis pods based on your workload:
spec:
  resources:
    requests:
      cpu: 1000m      # 1 CPU core
      memory: 2Gi     # 2 GB RAM
    limits:
      cpu: 2000m      # 2 CPU cores  
      memory: 4Gi     # 4 GB RAM
Sizing guidelines:
WorkloadCPUMemoryNotes
Development100m256MiMinimal, not for load testing
Light production500m1Gi10k ops/sec, 1M keys
Medium production1000m2-4Gi50k ops/sec, 10M keys
Heavy production2000m+8-16Gi>50k ops/sec, >10M keys
Redis is single-threaded for command processing. CPU limits above 2000m (2 cores) only benefit background tasks (BGSAVE, replication, etc.).

Memory Management

Configure Redis maxmemory and eviction policy:
spec:
  redis:
    maxmemory: "1gb"
    maxmemory-policy: "allkeys-lru"
Eviction policies:
  • noeviction - Return errors when memory limit is reached (default for persistence)
  • allkeys-lru - Evict least recently used keys (recommended for caching)
  • volatile-lru - Evict least recently used keys with TTL set
  • allkeys-lfu - Evict least frequently used keys (Redis 4.0+)
  • volatile-ttl - Evict keys with shortest TTL
Set maxmemory to 75-80% of container memory limit to leave room for replication buffers, COW during BGSAVE, and overhead.
Example: For a pod with memory: 4Gi limit:
spec:
  resources:
    limits:
      memory: 4Gi
  redis:
    maxmemory: "3gb"  # 75% of 4Gi
    maxmemory-policy: "allkeys-lru"

Persistence Tuning

RDB Snapshots

Configure snapshot frequency:
spec:
  redis:
    save: "900 1 300 10 60 10000"  # Default
    # Format: "<seconds> <changes> <seconds> <changes> ..."
    # Save if: 1 key changed in 900s, OR 10 keys in 300s, OR 10k keys in 60s
Performance impact:
  • Low frequency (save: "3600 1") - Less disk I/O, more data loss risk
  • High frequency (save: "60 1") - More disk I/O, less data loss risk
  • Disabled (save: "") - No snapshots, fastest (not recommended for production)
RDB snapshots use COW (copy-on-write). During BGSAVE, memory usage can temporarily double if dataset has high write rate.

AOF (Append-Only File)

Enable AOF for better durability:
spec:
  redis:
    appendonly: "yes"
    appendfsync: "everysec"  # Options: always, everysec, no
fsync modes:
  • always - Fsync after every write (safest, slowest)
  • everysec - Fsync once per second (default, good balance)
  • no - Never fsync, let OS decide (fastest, least safe)
AOF can cause write performance degradation. Use appendfsync: everysec for a balance between durability and performance.

Storage Performance

Use fast storage for Redis data volumes:
Storage TypeIOPSThroughputLatencyUse Case
AWS EBS gp316k+1000 MB/s1msRecommended
AWS EBS io264k+4000 MB/s0.25msMission-critical
GCP PD SSD30k+1200 MB/s1msRecommended
Azure Premium SSD20k+900 MB/s1msRecommended
Local NVMe100k+3000+ MB/s0.1msBest performance
Configure storage class:
spec:
  storage:
    size: 100Gi
    storageClassName: fast-ssd  # Use SSD-backed storage
Local NVMe offers the best performance but lacks data durability across node failures. Use with frequent backups.

Network Performance

Redis is network-intensive for replication and client connections.

Replication Optimization

Configure replication buffer sizes:
spec:
  redis:
    repl-backlog-size: "256mb"  # Default: 1mb (too small!)
    client-output-buffer-limit: "replica 512mb 128mb 60"
Buffer sizing:
  • repl-backlog-size should cover 5-10 minutes of writes during peak traffic
  • Calculate: (avg_write_rate_MB/s) * 600 seconds
  • Example: 1 MB/s write rate → 600 MB backlog
Insufficient replication backlog causes full resyncs, which are expensive (RDB generation + network transfer).

Connection Limits

spec:
  redis:
    maxclients: "10000"  # Default: 10000
Increase for high-concurrency workloads with many concurrent connections.

Synchronous Replication

Reduce data loss risk with synchronous replication:
spec:
  minSyncReplicas: 1  # Require at least 1 replica to ACK writes
  maxSyncReplicas: 2  # Use up to 2 replicas for sync replication
Trade-offs:
  • Higher minSyncReplicas - Better durability, higher write latency
  • Lower minSyncReplicas - Lower write latency, more data loss risk on failover
Setting minSyncReplicas: N means writes block until N replicas acknowledge. This increases write latency by 1-5ms per replica.

Redis Configuration Best Practices

cluster.yaml
apiVersion: redis.io/v1
kind: RedisCluster
metadata:
  name: production-cluster
spec:
  instances: 3
  imageName: redis:7.2
  
  storage:
    size: 100Gi
    storageClassName: fast-ssd
  
  resources:
    requests:
      cpu: 1000m
      memory: 4Gi
    limits:
      cpu: 2000m
      memory: 8Gi
  
  redis:
    # Memory
    maxmemory: "6gb"  # 75% of 8Gi limit
    maxmemory-policy: "allkeys-lru"
    
    # Persistence
    save: "900 1 300 10 60 10000"
    appendonly: "yes"
    appendfsync: "everysec"
    
    # Replication
    repl-backlog-size: "512mb"
    repl-diskless-sync: "yes"
    repl-diskless-sync-delay: "5"
    
    # Networking
    tcp-backlog: "511"
    tcp-keepalive: "300"
    timeout: "0"
    
    # Performance
    lazyfree-lazy-eviction: "yes"
    lazyfree-lazy-expire: "yes"
    lazyfree-lazy-server-del: "yes"
    replica-lazy-flush: "yes"
  
  minSyncReplicas: 1
  maxSyncReplicas: 1
  
  enablePodDisruptionBudget: true

Lazy Freeing

Enable lazy freeing to avoid blocking on expensive DEL operations:
spec:
  redis:
    lazyfree-lazy-eviction: "yes"    # Async eviction
    lazyfree-lazy-expire: "yes"      # Async key expiration
    lazyfree-lazy-server-del: "yes" # Async DEL commands
    replica-lazy-flush: "yes"        # Async FLUSHALL on replicas
This moves expensive memory deallocation to background threads (Redis 4.0+).

Diskless Replication

Use diskless replication for faster replica syncs:
spec:
  redis:
    repl-diskless-sync: "yes"
    repl-diskless-sync-delay: "5"  # Wait 5s for more replicas before starting sync
Benefits:
  • Skips intermediate RDB file write to disk
  • Streams snapshot directly to replicas over network
  • Faster sync, lower disk I/O
Drawbacks:
  • Cannot reuse snapshot for multiple replicas simultaneously
  • Delay before sync starts (waiting for replicas)
Diskless replication is recommended when network is faster than disk or when managing many replicas.

Performance Monitoring

Key Metrics to Watch

Throughput:
rate(redis_command_calls_total[5m])
Latency (from INFO stats):
redis_instantaneous_ops_per_sec
Memory usage:
redis_used_memory_bytes / redis_maxmemory_bytes
Hit rate:
rate(redis_keyspace_hits_total[5m]) / 
  (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
Replication lag:
redis_replication_lag_bytes{role="slave"}
Evicted keys:
rate(redis_evicted_keys_total[5m])

Grafana Dashboard

The bundled Grafana dashboard at charts/redis-operator/dashboards/redis-overview.json includes performance panels. Import it for instant visibility.

Troubleshooting Performance Issues

High CPU Usage

Possible causes:
  • Too many connections - Check redis_connected_clients
  • Slow commands - Use SLOWLOG GET to identify
  • Background tasks - Check redis_rdb_last_bgsave_duration_seconds
Solutions:
  • Increase CPU limit
  • Optimize application queries
  • Use pipelining to batch commands
  • Add read replicas to distribute load

High Memory Usage

Possible causes:
  • Dataset too large - Check redis_used_memory_bytes
  • Memory fragmentation - Check redis_mem_fragmentation_ratio
  • Replication buffers - Check redis_replication_lag_bytes
Solutions:
  • Increase maxmemory and container limits
  • Enable eviction policy
  • Restart Redis to defragment (requires failover)
  • Increase repl-backlog-size

Slow Replication

Possible causes:
  • Insufficient network bandwidth
  • Small replication backlog (causing full resyncs)
  • High write rate on primary
Solutions:
  • Increase repl-backlog-size
  • Use faster network (10 Gbps+)
  • Enable diskless replication
  • Reduce write rate or batch writes

Disk I/O Saturation

Possible causes:
  • Frequent BGSAVE operations
  • AOF rewrite operations
  • Slow storage backend
Solutions:
  • Reduce RDB snapshot frequency
  • Disable AOF or use appendfsync: everysec
  • Upgrade to faster storage (SSD/NVMe)
  • Use diskless replication

Build docs developers (and LLMs) love