Why Evaluation Matters
Measure Quality
Track accuracy, hallucination rates, and custom metrics across your LLM applications
Build Better Models
Use production data to create training datasets and fine-tune models
Catch Regressions
Test changes against consistent evaluation sets before deploying to production
Understand Users
Collect implicit and explicit feedback to learn what responses work best
Evaluation Workflow
Helicone’s evaluation features work together to create a continuous improvement loop:Create datasets
Select and curate high-quality examples from production traffic for evaluation and fine-tuningLearn about Datasets →
Score responses
Run evaluation frameworks (RAGAS, LangSmith, custom) and report scores to Helicone for centralized trackingLearn about Scores →
Collect feedback
Gather user ratings and behavioral signals to identify what worksLearn about Feedback →
Key Features
Datasets
Transform production requests into curated datasets for evaluation and fine-tuning:- Select from production: Filter requests using custom properties, scores, or feedback ratings
- Curate quality examples: Review and edit request/response pairs before adding to datasets
- Export multiple formats: Download as JSONL for fine-tuning or CSV for analysis
- API integration: Programmatically create and manage datasets
Scores
Report evaluation results from any framework for unified observability:- Framework agnostic: Works with RAGAS, LangSmith, or custom evaluation logic
- Track over time: Visualize how metrics evolve across deployments
- Compare experiments: Evaluate different prompts, models, or configurations
- Custom metrics: Track any integer or boolean metric (accuracy, hallucination, safety)
Feedback
Collect user satisfaction signals to understand response quality:- Explicit ratings: Thumbs up/down, star ratings from users
- Implicit signals: Track acceptance, engagement, and behavioral patterns
- Production insights: Learn what actually works for real users
- Dataset curation: Use highly-rated responses for training examples
Common Evaluation Patterns
RAG Evaluation with RAGAS
Evaluate retrieval-augmented generation for accuracy and groundedness:Replace Expensive Models
Use production logs from premium models to fine-tune cheaper alternatives:Log premium model outputs
Start logging successful requests from GPT-4, Claude Sonnet, or other expensive models
Create task-specific datasets
Filter and curate examples for specific use cases (support, extraction, generation)
Fine-tune smaller models
Export JSONL and train GPT-4o-mini, Gemini Flash, or other cost-effective models
Continuous Improvement Pipeline
Build a data flywheel for ongoing model improvement:- Tag production traffic with custom properties for segmentation
- Score automatically using evaluation frameworks or LLM-as-judge
- Collect user feedback through explicit ratings and implicit signals
- Filter top performers by combining scores and feedback ratings
- Auto-curate datasets with requests meeting quality thresholds
- Retrain periodically with new high-quality examples
- A/B test improvements before full deployment
Integration Examples
- Python
- TypeScript
Best Practices
Start Small
Begin with 50-100 carefully curated examples rather than thousands of uncurated ones
Focus on Tasks
Create task-specific datasets and metrics instead of general-purpose evaluations
Combine Signals
Use automated scores AND user feedback for comprehensive quality assessment
Iterate Continuously
Build evaluation into your development workflow, not just during initial testing
Track Over Time
Monitor metrics across deployments to catch regressions early
Test Before Deploy
Evaluate prompt or model changes against consistent test sets
Next Steps
Datasets
Create datasets from production traffic
Scores
Track evaluation metrics and performance
Feedback
Collect user satisfaction signals
RAGAS Integration
Evaluate RAG applications with RAGAS
Experiments
Compare different configurations
API Reference
View API documentation
Evaluation is not a one-time task—it’s an ongoing process. Start with basic metrics, build datasets from production, and continuously improve based on real-world performance.
