This guide will get you up and running with OpenCLIP in just a few minutes. You’ll learn how to load a pretrained model and perform zero-shot image classification.
Here’s a complete example of loading a model and classifying an image:
1
Import libraries
Import OpenCLIP and required dependencies:
import torchfrom PIL import Imageimport open_clip
2
Load model and preprocessing
Create a model with pretrained weights and get the preprocessing transform:
model, _, preprocess = open_clip.create_model_and_transforms( 'ViT-B-32', pretrained='laion2b_s34b_b79k')model.eval() # Set to evaluation mode# Get the tokenizer for texttokenizer = open_clip.get_tokenizer('ViT-B-32')
Models are in training mode by default, which affects BatchNorm and dropout layers. Always call model.eval() for inference.
3
Prepare image and text
Load and preprocess an image, then tokenize text labels:
# Load and preprocess an imageimage = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)# Define candidate labelstext = tokenizer(["a diagram", "a dog", "a cat"])
4
Compute embeddings and similarity
Run inference to get image-text similarity scores:
with torch.no_grad(), torch.autocast("cuda"): # Encode image and text image_features = model.encode_image(image) text_features = model.encode_text(text) # Normalize features image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Calculate similarity and get probabilities text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)print("Label probabilities:", text_probs)# Output: Label probabilities: tensor([[0.9927, 0.0038, 0.0035]])
5
Interpret results
Get the most likely label:
# Get the top predictionlabels = ["a diagram", "a dog", "a cat"]top_prob, top_idx = text_probs[0].max(dim=0)print(f"Predicted: {labels[top_idx]} ({top_prob.item():.1%} confidence)")# Output: Predicted: a diagram (99.3% confidence)
OpenCLIP provides 80+ pretrained models. List them all:
import open_clip# Get all available models and their pretrained variantsmodels = open_clip.list_pretrained()# Print first 10 (model_name, pretrained_tag) pairsfor model_name, pretrained in models[:10]: print(f"{model_name}: {pretrained}")
Each model can have multiple pretrained versions trained on different datasets (OpenAI, LAION-400M, LAION-2B, DataComp) with different training configurations.
from PIL import Imageimport open_clipimport torchmodel, _, preprocess = open_clip.create_model_and_transforms( 'ViT-B-32', pretrained='openai')model.eval()tokenizer = open_clip.get_tokenizer('ViT-B-32')# Define your custom classesclasses = [ "a photo of a cat", "a photo of a dog", "a photo of a bird", "a photo of a fish", "a photo of a horse"]image = preprocess(Image.open("animal.jpg")).unsqueeze(0)text = tokenizer(classes)with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)[0]# Print resultsfor class_name, prob in zip(classes, probs): print(f"{class_name}: {prob.item():.2%}")