Operator Performance
Supported Scale
The operator has been benchmarked to support: Up to 100 clusters per operator instance (3 pods/cluster = 300 total pods)Benchmark Results
Based on benchmarks atinternal/controller/cluster/reconciler_test.go:
| Clusters | Reconcile Latency | Throughput | API Calls/sec | Memory Usage |
|---|---|---|---|---|
| 10 | 55.40 ms | 180.5 reconciles/s | 5,055 | 17.96 MB/op |
| 50 | 751.13 ms | 66.57 reconciles/s | 1,864 | 360.89 MB/op |
| 100 | 2,659.08 ms | 37.61 reconciles/s | 1,053 | 1.36 GB/op |
Benchmark environment: Linux amd64 VM (4 vCPUs, 15 GiB RAM, AMD EPYC 7B13) running Go 1.25.1 on kind v0.29.0.
Recommended Operator Resources
Set operator resource requests/limits based on managed cluster count:| Managed Clusters | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Up to 10 | 100m | 500m | 128Mi | 256Mi |
| Up to 50 | 250m | 1000m | 512Mi | 1Gi |
| Up to 100 | 500m | 2000m | 1Gi | 2Gi |
values.yaml
Chart defaults (
100m/128Mi request, 500m/256Mi limit) are appropriate for small deployments with up to 10 clusters.Concurrency Tuning
The operator uses a work queue with configurable concurrency:values.yaml
- 5 (default) - Good for most deployments up to 50 clusters
- 10-15 - For 50-100 clusters with fast Kubernetes API server
- 20+ - Only for dedicated operator instances with powerful nodes
Performance Profiling
Enable pprof for profiling:values.yaml
Leader Election
The operator uses leader election for high availability:values.yaml
- Keep
leader-elect=truefor HA deployments - Use default timing values unless latency is critical
- Run 2-3 operator replicas for redundancy
Monitoring Operator Performance
Watch reconciliation metrics:Redis Instance Performance
CPU and Memory
Size Redis pods based on your workload:| Workload | CPU | Memory | Notes |
|---|---|---|---|
| Development | 100m | 256Mi | Minimal, not for load testing |
| Light production | 500m | 1Gi | 10k ops/sec, 1M keys |
| Medium production | 1000m | 2-4Gi | 50k ops/sec, 10M keys |
| Heavy production | 2000m+ | 8-16Gi | >50k ops/sec, >10M keys |
Redis is single-threaded for command processing. CPU limits above 2000m (2 cores) only benefit background tasks (BGSAVE, replication, etc.).
Memory Management
Configure Redismaxmemory and eviction policy:
noeviction- Return errors when memory limit is reached (default for persistence)allkeys-lru- Evict least recently used keys (recommended for caching)volatile-lru- Evict least recently used keys with TTL setallkeys-lfu- Evict least frequently used keys (Redis 4.0+)volatile-ttl- Evict keys with shortest TTL
memory: 4Gi limit:
Persistence Tuning
RDB Snapshots
Configure snapshot frequency:- Low frequency (
save: "3600 1") - Less disk I/O, more data loss risk - High frequency (
save: "60 1") - More disk I/O, less data loss risk - Disabled (
save: "") - No snapshots, fastest (not recommended for production)
RDB snapshots use COW (copy-on-write). During BGSAVE, memory usage can temporarily double if dataset has high write rate.
AOF (Append-Only File)
Enable AOF for better durability:always- Fsync after every write (safest, slowest)everysec- Fsync once per second (default, good balance)no- Never fsync, let OS decide (fastest, least safe)
Storage Performance
Use fast storage for Redis data volumes:| Storage Type | IOPS | Throughput | Latency | Use Case |
|---|---|---|---|---|
| AWS EBS gp3 | 16k+ | 1000 MB/s | 1ms | Recommended |
| AWS EBS io2 | 64k+ | 4000 MB/s | 0.25ms | Mission-critical |
| GCP PD SSD | 30k+ | 1200 MB/s | 1ms | Recommended |
| Azure Premium SSD | 20k+ | 900 MB/s | 1ms | Recommended |
| Local NVMe | 100k+ | 3000+ MB/s | 0.1ms | Best performance |
Local NVMe offers the best performance but lacks data durability across node failures. Use with frequent backups.
Network Performance
Redis is network-intensive for replication and client connections.Replication Optimization
Configure replication buffer sizes:repl-backlog-sizeshould cover 5-10 minutes of writes during peak traffic- Calculate:
(avg_write_rate_MB/s) * 600 seconds - Example: 1 MB/s write rate → 600 MB backlog
Connection Limits
Synchronous Replication
Reduce data loss risk with synchronous replication:- Higher
minSyncReplicas- Better durability, higher write latency - Lower
minSyncReplicas- Lower write latency, more data loss risk on failover
Setting
minSyncReplicas: N means writes block until N replicas acknowledge. This increases write latency by 1-5ms per replica.Redis Configuration Best Practices
Recommended Production Config
cluster.yaml
Lazy Freeing
Enable lazy freeing to avoid blocking on expensive DEL operations:Diskless Replication
Use diskless replication for faster replica syncs:- Skips intermediate RDB file write to disk
- Streams snapshot directly to replicas over network
- Faster sync, lower disk I/O
- Cannot reuse snapshot for multiple replicas simultaneously
- Delay before sync starts (waiting for replicas)
Diskless replication is recommended when network is faster than disk or when managing many replicas.
Performance Monitoring
Key Metrics to Watch
Throughput:Grafana Dashboard
The bundled Grafana dashboard atcharts/redis-operator/dashboards/redis-overview.json includes performance panels. Import it for instant visibility.
Troubleshooting Performance Issues
High CPU Usage
Possible causes:- Too many connections - Check
redis_connected_clients - Slow commands - Use
SLOWLOG GETto identify - Background tasks - Check
redis_rdb_last_bgsave_duration_seconds
- Increase CPU limit
- Optimize application queries
- Use pipelining to batch commands
- Add read replicas to distribute load
High Memory Usage
Possible causes:- Dataset too large - Check
redis_used_memory_bytes - Memory fragmentation - Check
redis_mem_fragmentation_ratio - Replication buffers - Check
redis_replication_lag_bytes
- Increase
maxmemoryand container limits - Enable eviction policy
- Restart Redis to defragment (requires failover)
- Increase
repl-backlog-size
Slow Replication
Possible causes:- Insufficient network bandwidth
- Small replication backlog (causing full resyncs)
- High write rate on primary
- Increase
repl-backlog-size - Use faster network (10 Gbps+)
- Enable diskless replication
- Reduce write rate or batch writes
Disk I/O Saturation
Possible causes:- Frequent BGSAVE operations
- AOF rewrite operations
- Slow storage backend
- Reduce RDB snapshot frequency
- Disable AOF or use
appendfsync: everysec - Upgrade to faster storage (SSD/NVMe)
- Use diskless replication