Available models
| Model | Context length | Embedding dimensions | Model ID |
|---|---|---|---|
| Voyage Multimodal 3 | 32,000 tokens | 1024 | voyage-multimodal-3 |
With the multimodal model, images are counted as tokens using a rate of approximately 560 pixels per token. Each input must not exceed 32,000 tokens, and the total across all inputs must not exceed 320,000 tokens.
Image constraints
When working with images, keep these limits in mind:- Maximum 1,000 values per request
- Maximum 16 million pixels per image
- Maximum 20 MB per image
- Supported formats: PNG, JPEG, WebP, GIF
- Images must be provided as URLs or Base64-encoded data URLs
Usage examples
Embed a single image
You can embed individual images by providing URLs or Base64-encoded strings:Embed multiple images as a single embedding
You can combine multiple images into a single embedding by passing an array:Combine text and images
For true multimodal embeddings, combine text and images in a single input:Model settings
You can customize the multimodal model behavior with these settings:Available settings
The type of input being embedded. This affects the prompt prepended to your inputs:
- For
query: “Represent the query for retrieving supporting documents: ” is prepended - For
document: “Represent the document for retrieval: ” is prepended
query when embedding search queries and document when embedding items to be searched.The encoding format for the output embeddings. When set to
base64, embeddings are returned as Base64-encoded NumPy arrays of single-precision floats instead of raw float arrays.This can be useful for efficient transmission and storage.Whether to automatically truncate inputs that exceed the context length limit. When enabled, long inputs are trimmed to fit within the 32,000 token limit.
Use cases
Visual search
Enable users to search using images by embedding both queries and documents with images:Image similarity
Find similar images by comparing their embeddings:Multimodal document understanding
Create rich document embeddings that capture both textual and visual information:Best practices
Image preparation
- Optimize image size: Resize images to reasonable dimensions before encoding to stay within the 16 million pixel limit
- Use appropriate formats: JPEG for photos, PNG for graphics with transparency, WebP for optimal compression
- Cache Base64 encodings: Pre-encode frequently used images to reduce processing time
Token management
With images consuming approximately 560 pixels per token:- A 1024x1024 image ≈ 1,873 tokens
- A 512x512 image ≈ 468 tokens
- A 256x256 image ≈ 117 tokens
Input type selection
Always setinputType based on your use case: