Overview
LiteLLM provides robust fallback mechanisms to ensure high availability of your LLM applications. When a model fails or is unavailable, LiteLLM automatically retries with fallback models or deployments.
How Fallbacks Work
Fallbacks execute in order when:
- API returns an error (rate limit, timeout, service unavailable)
- Model deployment is down
- Context window is exceeded
- Content policy violations occur
Basic Fallback Configuration
Single Model Fallback
import litellm
from litellm import completion
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
fallbacks=["gpt-3.5-turbo", "claude-2"]
)
# Tries: gpt-4 -> gpt-3.5-turbo -> claude-2
Router Fallbacks
The Router provides advanced fallback logic across multiple deployments:
from litellm import Router
router = Router(
model_list=[
{
"model_name": "gpt-4",
"litellm_params": {
"model": "gpt-4",
"api_key": "sk-..."
}
},
{
"model_name": "gpt-3.5",
"litellm_params": {
"model": "gpt-3.5-turbo",
"api_key": "sk-..."
}
},
{
"model_name": "claude",
"litellm_params": {
"model": "claude-2",
"api_key": "sk-ant-..."
}
}
],
fallbacks=[
{"gpt-4": ["gpt-3.5", "claude"]},
{"gpt-3.5": ["claude"]}
],
max_fallbacks=5 # Maximum fallback attempts
)
response = router.completion(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Fallback Types
1. Default Fallbacks
Apply to all models globally:
from litellm import Router
router = Router(
model_list=[...],
default_fallbacks=["gpt-3.5-turbo", "claude-2"]
)
# Any failing model will try these fallbacks
2. Model-Specific Fallbacks
Define fallbacks per model:
router = Router(
model_list=[...],
fallbacks=[
{"gpt-4": ["gpt-4-turbo", "gpt-3.5-turbo"]},
{"claude-3-opus": ["claude-3-sonnet", "claude-2"]},
{"gemini-pro": ["gpt-3.5-turbo"]}
]
)
3. Context Window Fallbacks
Automatic fallback when context window is exceeded:
router = Router(
model_list=[
{
"model_name": "gpt-3.5",
"litellm_params": {"model": "gpt-3.5-turbo"} # 4K context
},
{
"model_name": "gpt-3.5-16k",
"litellm_params": {"model": "gpt-3.5-turbo-16k"} # 16K context
},
{
"model_name": "claude-100k",
"litellm_params": {"model": "claude-2"} # 100K context
}
],
context_window_fallbacks=[
{"gpt-3.5": ["gpt-3.5-16k", "claude-100k"]}
]
)
# Automatically falls back if prompt exceeds 4K tokens
response = router.completion(
model="gpt-3.5",
messages=[{"role": "user", "content": very_long_prompt}]
)
4. Content Policy Fallbacks
Fallback when content policy violations occur:
router = Router(
model_list=[...],
content_policy_fallbacks=[
{"gpt-4": ["claude-2"]}, # Claude may have different policies
{"gpt-3.5-turbo": ["llama-2"]}
]
)
Advanced Fallback Configuration
Fallback with Custom Parameters
Pass different parameters to fallback models:
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
fallbacks=[
{
"model": "gpt-3.5-turbo",
"temperature": 0.5,
"max_tokens": 100
},
{
"model": "claude-2",
"temperature": 0.7
}
]
)
Controlling Fallback Behavior
router = Router(
model_list=[...],
max_fallbacks=3, # Maximum number of fallback attempts
retry_after=5, # Wait 5 seconds between fallback attempts
allowed_fails=2, # Attempts before marking deployment as failed
cooldown_time=60, # Cooldown period for failed deployments (seconds)
disable_cooldowns=False # Enable/disable cooldown mechanism
)
Fallback Policies
Allowed Fails Policy
Control when a deployment enters cooldown:
from litellm.types.router import AllowedFailsPolicy
router = Router(
model_list=[...],
allowed_fails=3, # Number of failures before cooldown
allowed_fails_policy=AllowedFailsPolicy(
BadRequestErrorAllowedFails=0, # Immediate cooldown for bad requests
AuthenticationErrorAllowedFails=0, # Immediate cooldown for auth errors
TimeoutErrorAllowedFails=5, # More tolerance for timeouts
RateLimitErrorAllowedFails=2, # Moderate tolerance for rate limits
ContentPolicyViolationErrorAllowedFails=1
)
)
Deployment Cooldown
When a deployment fails multiple times, it enters a cooldown period:
router = Router(
model_list=[...],
allowed_fails=2, # Fail 2 times before cooldown
cooldown_time=300, # Cooldown for 5 minutes
disable_cooldowns=False
)
During cooldown, the deployment is excluded from routing but can still be used as a last resort if all others fail.
Async Fallback Support
Fallbacks work with async operations:
import asyncio
from litellm import acompletion
async def make_request():
response = await acompletion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
fallbacks=["gpt-3.5-turbo", "claude-2"]
)
return response
response = asyncio.run(make_request())
Monitoring Fallbacks
Tracking Fallback Usage
from litellm.integrations import CustomLogger
class FallbackLogger(CustomLogger):
def log_success_event(self, kwargs, response_obj, start_time, end_time):
# Check if fallback was used
model_used = response_obj.model
original_model = kwargs.get("model")
if model_used != original_model:
print(f"Fallback occurred: {original_model} -> {model_used}")
litellm.callbacks = [FallbackLogger()]
Fallback information is included in response headers:
response = router.completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# Check response headers for fallback info
print(response._hidden_params.get("model_id")) # Actual deployment used
print(response._hidden_params.get("api_base")) # API endpoint used
Best Practices
Fallback Strategy Recommendations
- Order by capability - Place most capable/expensive models first
- Consider cost - Fallback to cheaper alternatives when appropriate
- Mix providers - Diversify across OpenAI, Anthropic, Google, etc.
- Test thoroughly - Verify fallbacks work as expected
- Monitor cooldowns - Alert when deployments enter cooldown
- Set reasonable limits - Balance availability vs. cost
Common Patterns
High Availability Pattern
router = Router(
model_list=[
# Primary: Multiple GPT-4 deployments
{"model_name": "gpt-4", "litellm_params": {...}},
{"model_name": "gpt-4", "litellm_params": {...}},
# Fallback: GPT-3.5 Turbo
{"model_name": "gpt-3.5", "litellm_params": {...}},
{"model_name": "gpt-3.5", "litellm_params": {...}},
# Final fallback: Claude
{"model_name": "claude", "litellm_params": {...}}
],
fallbacks=[{"gpt-4": ["gpt-3.5", "claude"]}],
max_fallbacks=10
)
Cost-Optimized Pattern
router = Router(
model_list=[
{"model_name": "cheap", "litellm_params": {"model": "gpt-3.5-turbo"}},
{"model_name": "expensive", "litellm_params": {"model": "gpt-4"}}
],
fallbacks=[{"cheap": ["expensive"]}] # Only use expensive when cheap fails
)
Error Handling
from litellm import completion
from litellm.exceptions import APIError, RateLimitError
try:
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
fallbacks=["gpt-3.5-turbo", "claude-2"]
)
except Exception as e:
# All models failed, including fallbacks
print(f"All models failed: {str(e)}")
# Enable verbose logging to see details
litellm.set_verbose = True