Docker Swarm Clustering

Dokploy leverages Docker Swarm for container orchestration, enabling you to deploy and scale applications across multiple nodes in a cluster.

Swarm Architecture

Docker Swarm provides native clustering capabilities for Docker:

Manager Nodes

Control the cluster, maintain state, and schedule services across worker nodes.

Worker Nodes

Execute tasks and run containerized applications as assigned by managers.

Swarm Initialization

When setting up a deploy server, Dokploy automatically initializes Docker Swarm:

# Automatic IP detection and Swarm initialization
get_ip() {
  # Try IPv4 first
  ip=$(curl -4s --connect-timeout 5 https://ifconfig.io 2>/dev/null)
  
  # Fallback to IPv6 if needed
  if [ -z "$ip" ]; then
    ip=$(curl -6s --connect-timeout 5 https://ifconfig.io 2>/dev/null)
  fi
  
  echo "$ip"
}

advertise_addr=$(get_ip)
docker swarm init --advertise-addr $advertise_addr

Dokploy automatically detects whether to use IPv4 or IPv6 based on server network configuration.

Swarm Validation

Check if a server is already part of a Swarm:

if docker info | grep -q 'Swarm: active'; then
  echo "Already part of a Docker Swarm"
fi

Node Management

Viewing Cluster Nodes

Retrieve all nodes in the cluster:

const nodes = await getNodes(serverId);

Each node includes:

Node ID
Hostname
Role (manager/worker)
Availability (active/pause/drain)
Status (ready/down)
Manager status
Platform information

Adding Worker Nodes

Get Worker Join Token

Retrieve the join command with worker token:

const { command, version } = await addWorker(serverId);
// Returns: "docker swarm join --token SWMTKN-1-xxx... 192.168.1.100:2377"

Execute on New Node

Run the join command on the machine you want to add as a worker:

docker swarm join --token SWMTKN-1-xxx... 192.168.1.100:2377

Verify Node Joined

Check the node appears in the cluster:

docker node ls

Adding Manager Nodes

For high availability, add additional manager nodes:

Get Manager Join Token

const { command, version } = await addManager(serverId);
// Returns: "docker swarm join --token SWMTKN-1-yyy... 192.168.1.100:2377"

Join as Manager

Execute the command on the new manager node:

docker swarm join --token SWMTKN-1-yyy... 192.168.1.100:2377

Maintain an odd number of manager nodes (1, 3, 5, or 7) to ensure proper quorum for cluster decisions.

Removing Nodes

Safely remove nodes from the cluster:

await removeWorker(nodeId, serverId);

The removal process:

Drains the node (stops accepting new tasks)
Removes the node from the cluster

# Executed automatically
docker node update --availability drain <nodeId>
docker node rm <nodeId> --force

Draining ensures running tasks are gracefully migrated to other nodes before removal.

Service Deployment

Creating Swarm Services

Dokploy deploys applications as Docker Swarm services for orchestration:

const settings: CreateServiceOptions = {
  Name: appName,
  TaskTemplate: {
    ContainerSpec: {
      Image: imageName,
      Env: environmentVariables,
      Mounts: volumeMounts
    },
    Networks: [{ Target: "dokploy-network" }],
    Placement: {
      Constraints: ["node.role==manager"]
    }
  },
  Mode: {
    Replicated: {
      Replicas: 3
    }
  },
  EndpointSpec: {
    Ports: [{
      TargetPort: 3000,
      PublishedPort: 3000,
      PublishMode: "host",
      Protocol: "tcp"
    }]
  }
};

Placement Constraints

Control where services run using constraints:

Placement: {
  Constraints: ["node.role==manager"]
}

Service Scaling

Scale services across the cluster:

Mode: {
  Replicated: {
    Replicas: 5  // Run 5 instances across cluster
  }
}

Or use global mode to run one instance per node:

Mode: {
  Global: {}
}

Traefik with Swarm

Dokploy deploys Traefik as a Swarm service for load balancing:

const settings: CreateServiceOptions = {
  Name: "dokploy-traefik",
  TaskTemplate: {
    ContainerSpec: {
      Image: "traefik:v3.6.7",
      Mounts: [
        {
          Type: "bind",
          Source: "/etc/dokploy/traefik/traefik.yml",
          Target: "/etc/traefik/traefik.yml"
        },
        {
          Type: "bind",
          Source: "/var/run/docker.sock",
          Target: "/var/run/docker.sock"
        }
      ]
    },
    Networks: [{ Target: "dokploy-network" }],
    Placement: {
      Constraints: ["node.role==manager"]
    }
  },
  EndpointSpec: {
    Ports: [
      {
        TargetPort: 80,
        PublishedPort: 80,
        PublishMode: "host",
        Protocol: "tcp"
      },
      {
        TargetPort: 443,
        PublishedPort: 443,
        PublishMode: "host",
        Protocol: "tcp"
      },
      {
        TargetPort: 443,
        PublishedPort: 443,
        PublishMode: "host",
        Protocol: "udp"  // HTTP/3
      }
    ]
  }
};

Traefik runs on manager nodes and automatically discovers services across the entire Swarm cluster.

Networking

Overlay Network

Dokploy creates an overlay network for cross-node communication:

docker network create \
  --driver overlay \
  --attachable \
  dokploy-network

Key Features:

Overlay Driver: Enables multi-host networking
Attachable: Allows standalone containers to connect
Automatic Encryption: Optional encryption for sensitive data

Service Discovery

Services can communicate using DNS:

# Service name becomes DNS hostname
curl http://my-service:3000

# Tasks can be addressed individually
curl http://my-service.1:3000

Container Queries in Swarm

Service Containers

Retrieve containers for a Swarm service:

const containers = await getServiceContainersByAppName(
  appName,
  serverId
);

Stack Containers

Get all containers in a stack deployment:

const containers = await getStackContainersByAppName(
  appName,
  serverId
);

Application Labels

Query containers by Dokploy labels:

const containers = await getContainersByAppLabel(
  appName,
  "swarm",  // Specify swarm type
  serverId
);

Service Updates

Update running services without downtime:

const service = docker.getService(appName);
const inspect = await service.inspect();

await service.update({
  version: parseInt(inspect.Version.Index),
  ...newSettings,
  TaskTemplate: {
    ...newSettings.TaskTemplate,
    ForceUpdate: inspect.Spec.TaskTemplate.ForceUpdate + 1
  }
});

Incrementing ForceUpdate forces a rolling update even if configuration hasn’t changed.

Rolling Updates

Swarm automatically performs rolling updates:

Stop old task on one node
Start new task with updated configuration
Wait for health check to pass
Repeat for next node

This ensures zero-downtime deployments.

High Availability

Manager Quorum

Maintain cluster consensus with proper manager count:

Managers	Fault Tolerance	Recommended For
1	0	Development
3	1	Small Production
5	2	Large Production
7	3	Enterprise

Never use an even number of managers (2, 4, 6) as it doesn’t improve fault tolerance and can cause split-brain scenarios.

Service Restart Policies

Configure automatic restart behavior:

TaskTemplate: {
  RestartPolicy: {
    Condition: "on-failure",
    Delay: 5000000000,  // 5 seconds in nanoseconds
    MaxAttempts: 3
  }
}

Monitoring Swarm Services

Dokploy’s monitoring service tracks Swarm services:

metricsConfig: {
  containers: {
    refreshRate: 60,
    services: {
      include: ["my-swarm-service"],
      exclude: []
    }
  }
}

When monitoring Docker Compose or Swarm stacks, use the -p flag to properly identify all services within the stack.

Best Practices

Node Distribution

Spread manager nodes across different availability zones
Use at least 3 manager nodes for production
Keep critical services on manager nodes
Distribute worker nodes for load balancing

Resource Allocation

Set resource reservations for critical services
Define resource limits to prevent overconsumption
Use node labels for heterogeneous clusters
Monitor node resource utilization

Network Design

Use overlay networks for multi-host communication
Enable encryption for sensitive traffic
Implement proper firewall rules (port 2377)
Use host mode publishing for Traefik

Service Configuration

Use rolling updates for zero-downtime deployments
Configure health checks for all services
Set appropriate restart policies
Implement placement constraints strategically

Backup and Recovery

Regular backups of Swarm state on manager nodes
Document cluster topology
Test disaster recovery procedures
Maintain quorum during maintenance

Troubleshooting

Node Not Joining

Problem: Node fails to join the Swarm Solutions:

Verify port 2377 is open on manager nodes
Check network connectivity between nodes
Ensure token hasn’t expired
Verify Docker is running on both nodes
Check for firewall blocking connections

Service Not Starting

Problem: Service tasks fail to start Solutions:

Check service logs: docker service logs <service>
Verify image is accessible on all nodes
Check placement constraints are satisfiable
Ensure sufficient resources available
Review service configuration

Split-Brain Scenario

Problem: Manager nodes lose quorum Solutions:

Check network connectivity between managers
Verify time synchronization (NTP)
Ensure odd number of managers
Review manager node status
Force new cluster if necessary (last resort)

Service Not Updating

Problem: Service update doesn’t take effect Solutions:

Increment ForceUpdate counter
Check service version number
Verify no deployment conflicts
Review update logs
Try manual service removal and recreation

Network Issues

Problem: Services can’t communicate Solutions:

Verify overlay network exists: docker network ls
Check service network configuration
Ensure both services on same network
Test DNS resolution within containers
Review network firewall rules

Getting Started

Deployment

Databases

Infrastructure

Advanced

​Swarm Architecture

Manager Nodes

Worker Nodes

​Swarm Initialization

​Swarm Validation

​Node Management

​Viewing Cluster Nodes

​Adding Worker Nodes

​Adding Manager Nodes

​Removing Nodes

​Service Deployment

​Creating Swarm Services

​Placement Constraints

​Service Scaling

​Traefik with Swarm

​Networking

​Overlay Network

​Service Discovery

​Container Queries in Swarm

​Service Containers

​Stack Containers

​Application Labels

​Service Updates

​Rolling Updates

​High Availability

​Manager Quorum

​Service Restart Policies

​Monitoring Swarm Services

​Best Practices

​Troubleshooting

​Node Not Joining

​Service Not Starting

​Split-Brain Scenario

​Service Not Updating

​Network Issues

Build docs developers (and LLMs) love

Swarm Architecture

Swarm Initialization

Swarm Validation

Node Management

Viewing Cluster Nodes

Adding Worker Nodes

Adding Manager Nodes

Removing Nodes

Service Deployment

Creating Swarm Services

Placement Constraints

Service Scaling

Traefik with Swarm

Networking

Overlay Network

Service Discovery

Container Queries in Swarm

Service Containers

Stack Containers

Application Labels

Service Updates

Rolling Updates

High Availability

Manager Quorum

Service Restart Policies

Monitoring Swarm Services

Best Practices

Troubleshooting

Node Not Joining

Service Not Starting

Split-Brain Scenario

Service Not Updating

Network Issues