Context Caching
Context caching allows you to store frequently used input tokens in a dedicated cache, eliminating the need to repeatedly pass the same tokens to the model. This significantly reduces costs and improves response times for applications with repeated context.Why Use Context Caching?
Cost Reduction
Cached tokens cost ~90% less than regular input tokens
Lower Latency
Skip re-processing of large, repeated content
Long Context
Efficiently handle millions of tokens of context
Caching Types
Vertex AI offers two caching mechanisms:Implicit Caching (Automatic)
- Enabled by default for all Gemini 2.5 and 3 models
- No explicit setup required
- Automatic cost savings for repeated prefixes
- Minimum tokens: 2,048 for Gemini 2.5 models
- Best for: Applications with consistent prefixes
Explicit Caching (Manual)
- Developer-controlled cache creation and management
- Guaranteed cost savings with predictable pricing
- Minimum tokens: 2,048 for all supported models
- Cache duration: 60 minutes default, configurable up to 1 hour
- Best for: Large documents, system instructions, few-shot examples
Supported Models
| Model | Implicit Caching | Explicit Caching |
|---|---|---|
| gemini-3.1-pro-preview | ✓ | ✓ |
| gemini-3-flash-preview | ✓ | ✓ |
| gemini-2.5-pro | ✓ (cost savings) | ✓ |
| gemini-2.5-flash | ✓ (cost savings) | ✓ |
Implicit Caching
How It Works
Implicit caching automatically identifies common prefixes across requests:Optimization Tips
Check Cache Status
Monitor cache hits in usage metadata:Explicit Caching
Create a Cache
Create a named cache for repeated use:Use a Cache
Reference the cache in subsequent requests:Cache with System Instructions
Include system instructions in the cache:Cache Management
Retrieve a Cache
Get cache details by ID:List All Caches
View all caches in your project:Update Cache Expiration
Extend cache lifetime:Delete a Cache
Remove a cache when no longer needed:Context Caching in Chat
Use caching with multi-turn conversations:Cache with Multiple Documents
Cache large corpora for RAG applications:Cache Expiration Strategies
Time-to-Live (TTL)
Set relative expiration time:Absolute Expiration
Set specific expiration timestamp:Cost Analysis
Calculate Savings
Best Practices
Cache Large Content
Only cache content ≥2,048 tokens for eligibility
Monitor Expiration
Refresh caches before they expire for uninterrupted service
Use Appropriate TTL
Balance cost savings with cache storage costs
Track Usage
Monitor cached_content_token_count in responses
When to Use Caching
✅ Good Use Cases:- Large, static documents (manuals, papers, codebases)
- Repeated system instructions
- Few-shot examples in prompts
- RAG applications with fixed knowledge bases
- Multi-turn conversations with shared context
- Frequently changing content
- Single-use queries
- Small prompts (less than 2,048 tokens)
- Real-time data that must be fresh
Error Handling
Advanced Patterns
Auto-Refreshing Cache
Keep cache alive for long-running applications:Next Steps
Batch Prediction
Combine caching with batch processing
Function Calling
Cache function declarations
Multimodal
Cache large media files
Grounding
Cache grounded data sources