Skip to main content

Quick Start Guide

This guide will help you run your first segmentation with SAM 3 in just a few minutes. We’ll cover both image and video segmentation with text prompts.
Before starting, make sure you have installed SAM 3 and authenticated with Hugging Face to access the model checkpoints.

Image Segmentation

Let’s start with a simple image segmentation example using a text prompt.
1

Import Dependencies

Import the required modules:
import torch
from PIL import Image
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
2

Enable GPU Optimizations

Enable TensorFloat-32 and automatic mixed precision for faster inference:
# Enable TF32 for Ampere GPUs (A100, RTX 30/40 series)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Use bfloat16 for automatic mixed precision
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()
TF32 provides a good balance between performance and accuracy on Ampere and newer GPUs.
3

Load the Model

Load the SAM 3 image model and create a processor:
# Build the model (downloads checkpoint on first run)
model = build_sam3_image_model()

# Create processor with default settings
processor = Sam3Processor(model)
The model will be automatically downloaded from Hugging Face on first use. This may take a few minutes depending on your internet connection.
4

Load Your Image

Load an image using PIL:
# Load your image
image = Image.open("path/to/your/image.jpg")

# Set the image in the processor
inference_state = processor.set_image(image)
5

Prompt with Text

Use a text prompt to segment objects:
# Segment using a text description
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a person wearing a red shirt"
)

# Extract results
masks = output["masks"]     # Segmentation masks
boxes = output["boxes"]     # Bounding boxes
scores = output["scores"]   # Confidence scores
6

Visualize Results

Display the segmentation results:
import matplotlib.pyplot as plt
import numpy as np

# Show the original image
plt.figure(figsize=(10, 10))
plt.imshow(image)

# Overlay masks
for mask, score in zip(masks, scores):
    if score > 0.5:  # Filter by confidence
        # Convert mask to numpy and show as overlay
        mask_np = mask.cpu().numpy()
        plt.imshow(mask_np, alpha=0.5, cmap='jet')

plt.axis('off')
plt.show()

Complete Image Example

Here’s the complete code for image segmentation:
import torch
from PIL import Image
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Enable optimizations
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()

# Load model and processor
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load and process image
image = Image.open("your_image.jpg")
inference_state = processor.set_image(image)

# Run segmentation with text prompt
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a cat"
)

# Get results
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
print(f"Found {len(masks)} objects with average score {scores.mean():.2f}")

Video Segmentation

SAM 3 also supports video segmentation with temporal tracking.
1

Import Video Predictor

Import the video predictor:
from sam3 import build_sam3_video_predictor
2

Build Video Predictor

Create the video predictor:
video_predictor = build_sam3_video_predictor()
3

Start a Session

Start a video segmentation session:
# Path to video (MP4 file or directory of JPEG frames)
video_path = "path/to/your/video.mp4"

# Start session
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path=video_path,
    )
)

session_id = response["session_id"]
4

Add Text Prompt

Add a text prompt to segment objects across frames:
# Add prompt on a specific frame
response = video_predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=session_id,
        frame_index=0,  # Frame to prompt on
        text="a dog running",
    )
)

# Get segmentation for all frames
output = response["outputs"]

Complete Video Example

from sam3 import build_sam3_video_predictor

# Build predictor
video_predictor = build_sam3_video_predictor()

# Start session
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path="video.mp4",
    )
)

# Add text prompt
response = video_predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=response["session_id"],
        frame_index=0,
        text="person in blue jacket",
    )
)

# Get tracked segments across all frames
output = response["outputs"]
print(f"Segmented {len(output)} frames")

Adding Geometric Prompts

You can also use geometric prompts (boxes, points) in addition to or instead of text.

Box Prompts

from sam3.model.box_ops import box_xywh_to_cxcywh

# Define a box in [x, y, width, height] format (normalized 0-1)
box = [0.3, 0.3, 0.2, 0.4]  # center_x, center_y, width, height

# Add positive box prompt
output = processor.add_geometric_prompt(
    box=box,
    label=True,  # True for positive, False for negative
    state=inference_state
)

Combining Text and Geometric Prompts

# First set text prompt
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a car"
)

# Then refine with a box prompt
box = [0.5, 0.5, 0.3, 0.2]
output = processor.add_geometric_prompt(
    box=box,
    label=True,
    state=inference_state
)

Batch Processing

Process multiple images efficiently in batch:
import PIL.Image

# Load multiple images
images = [
    PIL.Image.open("image1.jpg"),
    PIL.Image.open("image2.jpg"),
    PIL.Image.open("image3.jpg"),
]

# Set image batch
inference_state = processor.set_image_batch(images)

# Run batch inference with text prompts
prompts = ["a dog", "a cat", "a bird"]
outputs = processor.set_text_prompt_batch(
    prompts=prompts,
    state=inference_state
)

Configuration Options

Customize the processor for your use case:
# Custom processor settings
processor = Sam3Processor(
    model=model,
    resolution=1008,              # Input resolution (default: 1008)
    device="cuda",                # Device (cuda/cpu)
    confidence_threshold=0.5      # Minimum confidence score
)

Tips for Best Results

  • Be specific: “a person in a red jacket” works better than “person”
  • Use descriptive attributes: colors, positions, actions
  • For multiple objects: “all dogs in the image” or “dogs”
  • Negative examples: prompts with no matches return empty results
  • Use batch processing for multiple images
  • Enable TF32 and mixed precision (bfloat16)
  • Lower resolution for faster inference (trade-off with quality)
  • Use torch.inference_mode() or torch.no_grad() contexts
  • Adjust confidence_threshold in processor settings
  • Try more specific text prompts
  • Combine text with geometric prompts for better accuracy
  • Some objects may genuinely not be present in the image
  • Start prompts on frames where objects are clearly visible
  • Use multiple prompts across different frames for better tracking
  • Video format can be MP4 or a directory of JPEG frames
  • Session management allows processing multiple videos

Next Steps

Now that you’ve run your first segmentation, explore more advanced features:

API Reference

Detailed API documentation for all SAM 3 components

Batched Inference

Process multiple images efficiently

Video Tracking

Deep dive into video segmentation and tracking

Interactive Refinement

Learn to refine segmentations with points and boxes

Build docs developers (and LLMs) love