Skip to main content

Overview

ACT (Action Chunking Transformer) is a policy designed for fine-grained bimanual manipulation tasks. It uses a transformer architecture with a variational objective to predict chunks of actions, making it particularly effective for complex manipulation tasks that require precise control. The policy was introduced in Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware and is optimized for bimanual robots like ALOHA.

Key Features

  • Action Chunking: Predicts multiple future actions at once (default: 100 steps)
  • Variational Objective: Uses a VAE to capture multimodal action distributions
  • Vision Backbone: ResNet-based image encoding with configurable backbone
  • Temporal Ensembling: Optional exponential weighting scheme for smoother action execution
  • Transformer Architecture: Encoder-decoder structure with configurable layers and dimensions

Architecture

The ACT architecture consists of:
  1. Vision Backbone: ResNet (default: ResNet18) for encoding camera observations
  2. Transformer Encoder: Processes visual and proprioceptive observations (default: 4 layers)
  3. Transformer Decoder: Generates action predictions (default: 1 layer due to original implementation bug)
  4. VAE Encoder (optional): Additional transformer for variational training (default: 4 layers)
The model predicts action “chunks” - sequences of actions that are executed over multiple timesteps, improving temporal consistency.

Training

Basic Training Command

lerobot-train \
  --policy=act \
  --dataset.repo_id=lerobot/aloha_mobile_cabinet

Training with Custom Configuration

lerobot-train \
  --policy=act \
  --dataset.repo_id=lerobot/aloha_mobile_cabinet \
  --policy.chunk_size=100 \
  --policy.n_action_steps=100 \
  --policy.dim_model=512 \
  --policy.use_vae=true \
  --training.num_epochs=5000 \
  --training.batch_size=8

Python API Training Example

from pathlib import Path
import torch
from lerobot.configs.types import FeatureType
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.utils import dataset_to_policy_features
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.policies.factory import make_pre_post_processors

# Set up
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dataset_id = "lerobot/aloha_mobile_cabinet"

# Configure policy features from dataset
dataset_metadata = LeRobotDatasetMetadata(dataset_id)
features = dataset_to_policy_features(dataset_metadata.features)

output_features = {key: ft for key, ft in features.items() if ft.type is FeatureType.ACTION}
input_features = {key: ft for key, ft in features.items() if key not in output_features}

# Create policy with configuration
cfg = ACTConfig(
    input_features=input_features,
    output_features=output_features,
    chunk_size=100,
    n_action_steps=100,
    dim_model=512,
    n_encoder_layers=4,
    n_decoder_layers=1,
    use_vae=True,
    latent_dim=32
)
policy = ACTPolicy(cfg)
preprocessor, postprocessor = make_pre_post_processors(cfg, dataset_stats=dataset_metadata.stats)

policy.train()
policy.to(device)

# Set up dataset with action chunking
def make_delta_timestamps(delta_indices, fps):
    if delta_indices is None:
        return [0]
    return [i / fps for i in delta_indices]

delta_timestamps = {
    "action": make_delta_timestamps(cfg.action_delta_indices, dataset_metadata.fps),
}
delta_timestamps |= {
    k: make_delta_timestamps(cfg.observation_delta_indices, dataset_metadata.fps)
    for k in cfg.image_features
}

dataset = LeRobotDataset(dataset_id, delta_timestamps=delta_timestamps)

# Create optimizer and dataloader
optimizer = cfg.get_optimizer_preset().build(policy.get_optim_params())
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=8,
    shuffle=True,
    pin_memory=device.type != "cpu",
    drop_last=True,
)

# Training loop
for batch in dataloader:
    batch = preprocessor(batch)
    loss, output_dict = policy.forward(batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Configuration Parameters

Input/Output Structure

n_obs_steps
int
default:"1"
Number of observation steps to pass to the policy. Currently only supports 1.
chunk_size
int
default:"100"
Size of action prediction chunks in environment steps.
n_action_steps
int
default:"100"
Number of action steps to execute per policy invocation. Must be ≤ chunk_size.

Vision Backbone

vision_backbone
str
default:"resnet18"
ResNet variant to use for image encoding (e.g., “resnet18”, “resnet34”).
pretrained_backbone_weights
str | None
default:"ResNet18_Weights.IMAGENET1K_V1"
Pretrained weights from torchvision. Set to None for random initialization.
replace_final_stride_with_dilation
bool
default:"false"
Whether to replace ResNet’s final 2x2 stride with dilated convolution.

Transformer Architecture

dim_model
int
default:"512"
Main hidden dimension of the transformer blocks.
n_heads
int
default:"8"
Number of attention heads in transformer blocks.
dim_feedforward
int
default:"3200"
Dimension of feed-forward layers in transformer blocks.
feedforward_activation
str
default:"relu"
Activation function in feed-forward layers.
n_encoder_layers
int
default:"4"
Number of transformer encoder layers.
n_decoder_layers
int
default:"1"
Number of transformer decoder layers. Set to 1 to match original implementation.
pre_norm
bool
default:"false"
Whether to use pre-normalization in transformer blocks.
dropout
float
default:"0.1"
Dropout rate in transformer layers.

VAE Configuration

use_vae
bool
default:"true"
Whether to use variational objective during training.
latent_dim
int
default:"32"
Dimensionality of the VAE latent space.
n_vae_encoder_layers
int
default:"4"
Number of transformer layers in the VAE encoder.
kl_weight
float
default:"10.0"
Weight for KL-divergence loss component. Total loss = reconstruction_loss + kl_weight * kld_loss.

Inference

temporal_ensemble_coeff
float | None
default:"null"
Coefficient for exponential temporal ensembling (typical value: 0.01). When enabled, n_action_steps must be 1.

Optimization

optimizer_lr
float
default:"1e-5"
Learning rate for the optimizer.
optimizer_weight_decay
float
default:"1e-4"
Weight decay for the optimizer.
optimizer_lr_backbone
float
default:"1e-5"
Learning rate for the vision backbone (can be set separately).

Normalization

normalization_mapping
dict
Normalization mode for each feature type. Default: {"VISUAL": "MEAN_STD", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}

Usage Example

Loading a Pretrained Model

from lerobot.policies.act.modeling_act import ACTPolicy

# Load from Hugging Face Hub
policy = ACTPolicy.from_pretrained("lerobot/act_aloha_mobile_cabinet")

# Use for inference
policy.eval()
with torch.no_grad():
    action = policy.select_action(observation)

Inference with Temporal Ensembling

from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.policies.act.modeling_act import ACTPolicy

cfg = ACTConfig(
    input_features=input_features,
    output_features=output_features,
    temporal_ensemble_coeff=0.01,  # Enable temporal ensembling
    n_action_steps=1  # Required when using temporal ensembling
)
policy = ACTPolicy(cfg)

# Reset ensembler when environment resets
policy.reset()

# Run inference at every step
for step in range(episode_length):
    action = policy.select_action(observation)

File Locations

Source files in the LeRobot repository:
  • Configuration: src/lerobot/policies/act/configuration_act.py
  • Model: src/lerobot/policies/act/modeling_act.py
  • Processor: src/lerobot/policies/act/processor_act.py
  • Examples: examples/tutorial/act/

Citation

@article{zhao2023learning,
  title={Learning fine-grained bimanual manipulation with low-cost hardware},
  author={Zhao, Tony Z and Kumar, Vikash and Levine, Sergey and Finn, Chelsea},
  journal={arXiv preprint arXiv:2304.13705},
  year={2023}
}

Additional Resources

Build docs developers (and LLMs) love