The Challenge
Production LLM applications face unique challenges:- Unpredictable errors: Provider outages, rate limits, and model changes
- Cost volatility: Usage spikes from viral features or abuse
- Quality degradation: Prompt drift, model updates, data issues
- Performance issues: Latency spikes affecting user experience
Solution Overview
Helicone provides a complete monitoring stack:Real-time Alerts
Get notified of errors, cost spikes, and latency issues
Request Observability
View every request/response with full context
Usage Analytics
Track costs, token usage, and model performance
User Tracking
Monitor per-user costs and identify abuse
Implementation Guide
1. Instrument Your Application
Add monitoring headers to all production requests:2. Set Up Critical Alerts
Create alerts for production issues:Error Rate Alert
Navigate to Settings → Alerts and create:Alert Configuration:
- Name: Production Error Rate
- Metric: Error Rate
- Threshold: > 5%
- Time Window: 10 minutes
- Minimum Requests: 10 (avoid false positives)
- Property:
Environment = production
- Slack:
#production-alerts - Email:
[email protected]
This catches provider outages, rate limit issues, and breaking changes quickly.
Cost Spike Alert
Alert Configuration:
- Name: Production Cost Spike
- Metric: Cost
- Threshold: > $100/day
- Time Window: 1 day
- Property:
Environment = production
- Email:
[email protected] - Slack:
#cost-alerts
Prevents unexpected bills from usage spikes or abuse.
Latency Alert
Alert Configuration:
- Name: High Latency
- Metric: Latency
- Threshold: P95 > 10000ms
- Time Window: 30 minutes
- Minimum Requests: 20
- Property:
Environment = production
- Slack:
#performance-alerts
Detects performance degradation affecting user experience.
3. Configure User Monitoring
Track per-user usage to identify abuse and understand behavior:- Identify users exceeding quotas
- Detect potential abuse patterns
- Understand usage by tier/cohort
- Calculate customer lifetime value
4. Implement Session Tracking
For multi-step workflows, track complete user journeys:- See total cost per user interaction
- Debug failures with full context
- Identify expensive workflow patterns
- Measure success rates for complete flows
5. Set Up Cost Controls
Implement rate limiting and quota management:6. Enable Caching
Reduce costs and latency for repetitive queries:- Go to Dashboard → Cache Analytics
- Track hit rate, savings, and performance
- Adjust cache TTL based on update frequency
Monitoring Dashboard
Key metrics to watch daily:Overview Metrics
Feature Breakdown
User Insights
Incident Response
When an alert fires:Assess Severity
- Error rate alert = High severity (affects all users)
- Cost alert = Medium severity (financial impact)
- Latency alert = Medium severity (poor UX)
- Feature-specific = Varies by feature criticality
Investigate in Helicone
- Click alert notification link
- Review affected requests
- Look for patterns:
- Specific users affected?
- Single feature or widespread?
- Started at specific time?
Take Action
For errors:
- Check provider status pages
- Review recent deployments
- Implement fallback/retry logic
- Identify top users/features
- Implement temporary rate limits
- Investigate for abuse
- Check model selection
- Review prompt sizes
- Consider model switching
Best Practices
Advanced: Custom Dashboards
Build custom monitoring using Helicone API:Monitoring Checklist
- All production requests instrumented with monitoring headers
- Error rate alert configured (less than 5%)
- Cost alert configured (appropriate threshold)
- Feature-specific alerts for critical features
- User tracking enabled (Helicone-User-Id)
- Session tracking for multi-step workflows
- Caching enabled for repetitive queries
- Rate limiting implemented
- Daily dashboard review scheduled
- Incident response playbook documented
Next Steps
Alerts Documentation
Deep dive into alert configuration options
User Metrics
Track and analyze per-user behavior
Debugging Guide
Learn how to investigate production issues
Cost Optimization
Reduce production costs