Uncloud uses a decentralized approach to state management. Instead of a central control plane, every machine maintains its own copy of the cluster state and synchronizes changes with peers.
This architecture favors Availability and Partition tolerance over Consistency in the CAP theorem. The cluster remains operational even when machines are offline or network partitions occur.
Corrosion Distributed Database
Uncloud’s state management is powered by Corrosion, a distributed SQLite database built by Fly.io. Corrosion runs on every machine and provides:
- Distributed SQLite: Each machine runs its own SQLite database
- Real-time replication: Changes propagate to other machines in subseconds
- Gossip protocol: Machines exchange state updates through peer-to-peer gossip
- Schema management: Automatic database migrations across the cluster
You can think of Corrosion as “SQLite with built-in multi-master replication.”
Corrosion Architecture
On each machine, Corrosion runs as a systemd service:
# Check Corrosion status
sudo systemctl status uncloud-corrosion
# View Corrosion logs
sudo journalctl -u uncloud-corrosion -f
The Corrosion process:
- Listens on the management IP: Uses the IPv6 management address for gossip traffic
- Stores data locally: Maintains a SQLite database at
/var/lib/uncloud/corrosion/store.db
- Gossips with peers: Exchanges updates with other machines in the cluster
- Provides an API: Exposes a local API for the Uncloud daemon to query and update state
The daemon never writes to the SQLite file directly. All interactions happen through Corrosion’s API, which handles conflict resolution and replication.
CRDT-Based Synchronization
Corrosion uses Conflict-free Replicated Data Types (CRDTs) to handle concurrent updates across machines.
CRDTs are special data structures that can be modified independently on different machines and are mathematically guaranteed to converge to the same state when all updates have been applied.
How CRDTs Work
Consider this scenario:
- Machine A and Machine B both have service
web with 2 replicas
- Network partition separates the machines
- User scales service on Machine A to 3 replicas
- User scales service on Machine B to 4 replicas
- Network partition heals
With traditional databases, this would be a conflict that requires manual resolution. With CRDTs, the two machines can automatically merge their states using predefined rules (for example, “last write wins” with vector clocks).
CRDT Limitations
CRDTs provide automatic conflict resolution, but the “merged” state might not match what either user intended. In the example above:
- Both users see their update succeed immediately (Availability)
- The final replica count will be consistent across machines (eventual Consistency)
- But the final count (3 or 4) depends on timestamps and may surprise both users
This is the trade-off of a decentralized system. Uncloud mitigates this by:
- Making most operations imperative rather than declarative
- Returning errors immediately when an operation can’t complete
- Directing operations to specific machines when possible
Eventual Consistency Model
Uncloud’s state is eventually consistent. This means:
- Machines may have slightly different views of the cluster at any given moment
- All machines will converge to the same state after all updates have propagated
- Propagation typically happens in milliseconds to seconds
What This Means in Practice
When you run uc deploy, here’s what happens:
- Immediate execution: The CLI connects to a machine and executes the deployment
- Local state update: That machine updates its Corrosion database
- Gossip propagation: Corrosion gossips the changes to other machines
- Convergence: Within seconds, all machines have the updated state
If you run uc ls immediately after uc deploy while connected to a different machine, you might briefly see the old state. Wait a second and run it again, and you’ll see the new containers.
Observing State Propagation
You can see state synchronization in action:
# Terminal 1: Connect to machine A
uc context use machine-a
uc deploy
# Terminal 2: Connect to machine B immediately after
uc context use machine-b
uc ps # May show old state
sleep 2
uc ps # Should show new state
In practice, propagation is fast enough that you rarely notice the delay.
Handling Network Partitions
Network partitions are inevitable in distributed systems. A partition occurs when machines can’t communicate with each other due to network failures, firewall issues, or machine crashes.
Partition Detection
Corrosion detects partitions through its gossip protocol. When a machine stops receiving gossip messages from a peer, it marks that peer as potentially unreachable.
You can see peer status in Corrosion logs:
sudo journalctl -u uncloud-corrosion | grep -i "peer"
Operating in a Partition
When a partition occurs:
- Each partition remains functional: Machines in each partition can still serve traffic and accept commands
- State diverges: Changes in one partition don’t reach the other partition
- Services continue: Containers keep running and serving requests
For example, if your cluster splits into two partitions:
- Partition A: Machines in US East
- Partition B: Machines in EU West
Users in each region can still:
- Deploy new services
- Scale existing services
- Access services running in their partition
They just can’t see or control services in the other partition.
Partition Healing
When network connectivity restores:
- Gossip resumes: Machines start exchanging updates again
- State reconciliation: CRDTs merge the diverged states
- Convergence: All machines converge to the same state
The reconciliation process is automatic. You don’t need to manually resolve conflicts or rebuild state.
If you made conflicting changes during a partition (for example, deployed different services with the same name in each partition), the CRDT merge rules will pick one version. The other may disappear after reconciliation. Always check cluster state after a partition heals.
State Reconciliation
Reconciliation is the process of merging diverged states after a partition or when a machine rejoins the cluster.
Reconciliation Process
- Detect divergence: Machines compare their state versions using vector clocks
- Exchange missing updates: Each machine sends updates the other hasn’t seen
- Apply CRDT merge: Conflicting updates are resolved using CRDT rules
- Converge: Both machines reach the same final state
This all happens automatically in the background. You’ll see reconciliation activity in the Corrosion logs:
sudo journalctl -u uncloud-corrosion | grep -i "sync\|reconcile"
State Consistency Checks
While state is eventually consistent, Uncloud takes steps to ensure data integrity:
- Schema versioning: Database schema changes are versioned and applied consistently
- Transaction boundaries: Related changes are grouped to prevent partial updates
- Validation: Invalid states are rejected before replication
Dealing with State Conflicts
If you suspect state inconsistencies across machines:
- Check Corrosion sync status on each machine
- Review recent Corrosion logs for errors or warnings
- Wait for convergence (usually a few seconds)
- Verify consistency by running
uc ps on different machines
If issues persist, check for:
- Network connectivity problems between machines
- Corrosion service failures
- Clock skew across machines (can affect timestamp-based CRDT operations)
Writes happen locally and then replicate:
- Local write latency: 1-10ms (SQLite write)
- Replication latency: 10-100ms (depending on network)
- Convergence time: 100-500ms (for small clusters)
Large state changes (for example, deploying many services) may take longer to fully propagate.
Reads are always local:
- Read latency: Less than 1ms (SQLite query)
- Consistency: May be slightly stale if recent writes haven’t propagated
For the freshest data, query the machine that performed the write.
Scalability
Corrosion’s gossip protocol scales well for small to medium clusters:
- Optimal cluster size: 3-20 machines
- Maximum tested: 50+ machines
- Gossip overhead: Increases with cluster size (O(n) bandwidth per machine)
For very large clusters, you may need to partition services across multiple independent Uncloud clusters.
State Storage Location
Corrosion stores all cluster state in:
/var/lib/uncloud/corrosion/
├── config.toml # Corrosion configuration
├── store.db # SQLite database file
├── schema.sql # Database schema
└── subscriptions/ # Replication metadata
The SQLite database contains:
- Machine registry and network configuration
- Service definitions and replica status
- Volume metadata
- DNS records
- Deployment history
Never manually edit the SQLite database or Corrosion configuration files. Always use the Uncloud CLI to modify cluster state. Direct edits can cause replication conflicts and state corruption.
Monitoring State Health
To monitor the health of state replication:
# Check Corrosion service status
sudo systemctl status uncloud-corrosion
# View recent replication activity
sudo journalctl -u uncloud-corrosion --since "5 minutes ago"
# Check database size
du -h /var/lib/uncloud/corrosion/store.db
Healthy replication shows:
- Regular gossip messages with all peers
- Recent successful sync operations
- No persistent errors or warnings
- Database size growing reasonably with cluster activity
Backup and Recovery
Since every machine has a full copy of cluster state, the cluster is inherently resilient to machine failures.
To back up cluster state:
# On any machine via SSH
sudo cp /var/lib/uncloud/corrosion/store.db ~/cluster-backup-$(date +%Y%m%d).db
To restore state (rarely needed):
- Stop the Corrosion service
- Replace the SQLite database file
- Start the Corrosion service
- Wait for reconciliation with other machines
In practice, you rarely need backups because:
- Multiple machines have redundant copies
- State can be recreated from machine configs
- Service definitions are typically stored in git (compose files)
Comparison to Traditional Architectures
| Aspect | Uncloud (Decentralized) | Kubernetes (Centralized) |
|---|
| Control plane | None (peer-to-peer) | etcd cluster (requires quorum) |
| Consistency | Eventual | Strong (linearizable reads) |
| Partition tolerance | Full (AP system) | Degraded (CP system) |
| Write path | Local → replicate | Consensus → local |
| Read path | Local (may be stale) | Local (always fresh) |
| Operational complexity | Low | High (manage etcd quorum) |
Uncloud trades strong consistency for operational simplicity. For most applications, eventual consistency is perfectly acceptable and the operational benefits are substantial.