Skip to main content

What is CoCa?

CoCa (Contrastive Captioner) is an extension of CLIP that combines:
  1. Contrastive Learning: Standard CLIP image-text matching
  2. Generative Captioning: Auto-regressive caption generation
This dual objective enables CoCa models to:
  • Perform zero-shot image classification (like CLIP)
  • Generate natural language captions for images
  • Achieve better representations through the combined training signal
Paper: CoCa: Contrastive Captioners are Image-Text Foundation Models

CoCa Architecture

CoCa adds a multimodal text decoder on top of the standard CLIP architecture:
[Image] → Image Encoder → Image Features →┬─ Contrastive Loss

[Text]  → Text Encoder   → Text Features  ─┴─ Caption Loss

              └─ Multimodal Decoder → Generated Caption
Key components:
  • Image Encoder: Same as CLIP (ViT, ResNet, etc.)
  • Unimodal Text Encoder: Encodes text for contrastive learning
  • Multimodal Text Decoder: Cross-attends to image features to generate captions

Available CoCa Models

OpenCLIP provides several CoCa model configurations:

Model Configs

ModelImage EncoderText EncoderMultimodal Decoder
coca_baseViT-B/16Transformer12-layer Transformer
coca_ViT-B-32ViT-B/32Transformer12-layer Transformer
coca_ViT-L-14ViT-L/14Transformer12-layer Transformer
coca_roberta-ViT-B-32ViT-B/32RoBERTa12-layer Transformer

Multimodal Decoder Configuration

Example configuration from coca_ViT-B-32:
"multimodal_cfg": {
    "context_length": 76,
    "vocab_size": 49408,
    "width": 512,
    "heads": 8,
    "layers": 12,
    "latent_dim": 512,
    "attn_pooler_heads": 8
}

Training CoCa from Scratch

Basic CoCa Training

Train CoCa with both contrastive and captioning objectives:
python -m open_clip_train.main \
    --model coca_ViT-L-14 \
    --train-data "/data/train.tar" \
    --train-num-samples 10000000 \
    --dataset-type webdataset \
    --batch-size 128 \
    --precision amp \
    --workers 8 \
    --lr 1e-3 \
    --wd 0.1 \
    --warmup 10000 \
    --epochs 32 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --report-to wandb
Loss weights:
  • --coca-contrastive-loss-weight 1.0: Weight for CLIP contrastive loss
  • --coca-caption-loss-weight 2.0: Weight for caption generation loss

Multi-GPU CoCa Training

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model coca_ViT-L-14 \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --lr 5e-4 \
    --wd 0.2 \
    --warmup 10000 \
    --epochs 32 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb

Fine-tuning CoCa

Fine-tuning on MSCOCO Captions

OpenCLIP provides a pretrained CoCa model that can be fine-tuned for captioning:
python -m open_clip_train.main \
    --model coca_ViT-L-14 \
    --pretrained laion2b_s13b_b90k \
    --dataset-type csv \
    --train-data "/data/mscoco/train2014.csv" \
    --csv-img-key filepath \
    --csv-caption-key title \
    --csv-separator "\t" \
    --warmup 1000 \
    --batch-size 128 \
    --lr 1e-5 \
    --wd 0.1 \
    --epochs 1 \
    --workers 3 \
    --coca-contrastive-loss-weight 0 \
    --coca-caption-loss-weight 1 \
    --report-to wandb \
    --log-every-n-steps 100
Key changes for fine-tuning:
  • --pretrained laion2b_s13b_b90k: Start from pretrained weights
  • --lr 1e-5: Lower learning rate for fine-tuning
  • --epochs 1: Fine-tune for fewer epochs
  • --coca-contrastive-loss-weight 0: Disable contrastive loss (captioning only)
  • --coca-caption-loss-weight 1: Only train the generative head

Preparing MSCOCO Data

Create a CSV file with image paths and captions using CLIP_benchmark:
from clip_benchmark.datasets.builder import build_dataset
import pandas as pd
import os

root_path = "path/to/data/dir"  # Set this to your data directory

# Download and load MSCOCO
ds = build_dataset("mscoco_captions", root=root_path, split="train", task="captioning")
coco = ds.coco
imgs = coco.loadImgs(coco.getImgIds())

# Create CSV with all image-caption pairs
future_df = {"filepath": [], "title": []}
for img in imgs:
    caps = coco.imgToAnns[img["id"]]
    for cap in caps:
        future_df["filepath"].append(img["file_name"])
        future_df["title"].append(cap["caption"])

# Save to CSV
pd.DataFrame.from_dict(future_df).to_csv(
    os.path.join(root_path, "train2014.csv"),
    index=False,
    sep="\t"
)
This creates a tab-separated CSV:
filepath\ttitle
COCO_train2014_000000000009.jpg\tA person on a motorcycle on a dirt road
COCO_train2014_000000000009.jpg\tA man riding a motorcycle down a dirt road
...

Generating Captions with CoCa

Basic Caption Generation

import open_clip
import torch
from PIL import Image

# Load pretrained CoCa model
model, _, transform = open_clip.create_model_and_transforms(
    model_name="coca_ViT-L-14",
    pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)

# Load and preprocess image
im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)

# Generate caption
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(im)

# Decode and print
caption = open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", "")
print(caption)
# Output: "a cat sitting on a windowsill"

Batch Caption Generation

import open_clip
import torch
from PIL import Image

model, _, transform = open_clip.create_model_and_transforms(
    model_name="coca_ViT-L-14",
    pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)
model.eval()

# Load multiple images
images = [
    Image.open("cat.jpg").convert("RGB"),
    Image.open("dog.jpg").convert("RGB"),
    Image.open("car.jpg").convert("RGB"),
]

# Preprocess
images_tensor = torch.stack([transform(im) for im in images])

# Generate captions
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(images_tensor)

# Decode all captions
for i, gen in enumerate(generated):
    caption = open_clip.decode(gen).split("<end_of_text>")[0].replace("<start_of_text>", "")
    print(f"Image {i}: {caption}")

Advanced Generation Options

# Generate with custom parameters
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(
        im,
        seq_len=77,           # Maximum sequence length
        temperature=1.0,      # Sampling temperature
        top_p=0.9,           # Nucleus sampling
    )

CoCa vs CLIP

When to Use CoCa

Use CoCa when:
  • You need both contrastive and generative capabilities
  • Image captioning is important for your application
  • You want richer image-text representations
  • You have data with detailed captions

When to Use CLIP

Use CLIP when:
  • You only need contrastive learning (classification, retrieval)
  • Training speed is critical (CoCa is slower due to caption generation)
  • You have limited compute resources
  • Your captions are short or simple

Training Time Comparison

ModelArchitectureTraining Speed (relative)Memory Usage (relative)
CLIP ViT-L/14Dual encoder1.0×1.0×
CoCa ViT-L/14Dual encoder + decoder0.6×1.4×
CoCa is slower due to the autoregressive caption generation during training.

Example Training Configurations

Small-Scale CoCa Training

# CoCa ViT-B/32 on CC3M
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --model coca_ViT-B-32 \
    --train-data "/data/cc3m/train-{0000..0331}.tar" \
    --train-num-samples 3000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --lr 1e-3 \
    --wd 0.1 \
    --warmup 5000 \
    --epochs 30 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --report-to tensorboard

Large-Scale CoCa Training

# CoCa ViT-L/14 on LAION-400M
srun python -u src/open_clip_train/main.py \
    --model coca_ViT-L-14 \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --lr 5e-4 \
    --wd 0.2 \
    --warmup 10000 \
    --epochs 32 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb \
    --remote-sync s3://bucket/coca-checkpoints

CoCa with RoBERTa Text Encoder

# Use RoBERTa for the unimodal text encoder
python -m open_clip_train.main \
    --model coca_roberta-ViT-B-32 \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --lr 5e-4 \
    --wd 0.1 \
    --warmup 2000 \
    --epochs 10 \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0

Pretrained CoCa Models

OpenCLIP provides pretrained CoCa models:
import open_clip

# List available CoCa models
models = open_clip.list_pretrained()
coca_models = [m for m in models if 'coca' in m[0].lower()]
for model_name, pretrained in coca_models:
    print(f"{model_name}: {pretrained}")

# Load pretrained CoCa
model, _, preprocess = open_clip.create_model_and_transforms(
    'coca_ViT-L-14',
    pretrained='mscoco_finetuned_laion2B-s13B-b90k'
)
Available pretrained weights:
  • laion2b_s13b_b90k: Pretrained on LAION-2B
  • mscoco_finetuned_laion2B-s13B-b90k: LAION-2B pretraining + MSCOCO fine-tuning

Using CoCa for Multiple Tasks

Image Classification (Zero-Shot)

import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    'coca_ViT-L-14',
    pretrained='mscoco_finetuned_laion2B-s13B-b90k'
)
tokenizer = open_clip.get_tokenizer('coca_ViT-L-14')

image = preprocess(Image.open("dog.jpg")).unsqueeze(0)
text = tokenizer(["a dog", "a cat", "a car"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
print("Probabilities:", similarity)

Image Captioning

# Same model, different usage
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(image)

caption = open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", "")
print("Caption:", caption)

Image-Text Retrieval

# Encode multiple images and texts
images = torch.stack([preprocess(Image.open(f"image_{i}.jpg")) for i in range(10)])
texts = tokenizer([f"caption {i}" for i in range(100)])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(images)
    text_features = model.encode_text(texts)
    
    # Normalize
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Compute similarity matrix
    similarity = image_features @ text_features.T  # [10, 100]
    
    # Top-5 captions for each image
    top5 = similarity.topk(5, dim=-1)
    print("Top 5 matches:", top5.indices)

Tips for Training CoCa

Training tips:
  • Start with contrastive-only training, then add caption loss gradually
  • Use higher weight for caption loss (2.0 vs 1.0 for contrastive)
  • Fine-tune on high-quality caption datasets (MSCOCO) for best generation
  • Use gradient checkpointing for memory efficiency with large models
  • Caption generation is slower - expect 40-60% of CLIP training speed
Common issues:
  • CoCa requires more memory due to the multimodal decoder
  • Gradient accumulation is not compatible with distillation for CoCa
  • Caption quality depends heavily on training data quality
  • Very short captions may not benefit from the generative objective

Credits

CoCa implementation in OpenCLIP:

Next Steps

Training Overview

Learn about general CLIP training

Fine-tuning

Fine-tune CoCa models on custom datasets

Configuration

Explore all CoCa training parameters

Inference

Use pretrained CoCa for captioning and classification

Build docs developers (and LLMs) love