Practice
This module includes practical assignments to help you master monitoring and observability for ML systems in production.Overview
You’ll complete three pull requests (PRs) integrating monitoring tools into your application, plus document your monitoring strategy.These assignments build on previous modules. You should have a working ML application deployed in a pipeline (Airflow, Kubeflow, or Dagster) from Module 4.
Homework 13: Monitoring and Observability
Key Tasks
PR1: SigNoz Integration
Objective: Add SigNoz monitoring to your application Requirements:- Install SigNoz on your Kubernetes cluster using Helm
- Instrument your application with OpenTelemetry or OpenLLMetry
- Configure tracing to send data to SigNoz
- Verify traces are appearing in the SigNoz UI
- Add custom spans to track important operations
Python Application Example
Python Application Example
Kubernetes Deployment
Kubernetes Deployment
FastAPI Integration
FastAPI Integration
- SigNoz is installed and running in your cluster
- Application sends traces to SigNoz
- At least 3 custom spans are instrumented
- Traces are visible in SigNoz UI
- Screenshots included in PR description
PR2: Grafana Dashboard
Objective: Create a Grafana dashboard for your application Requirements:- Install Prometheus and Grafana using kube-prometheus-stack
- Expose metrics from your application (e.g., using
prometheus_client) - Create a ServiceMonitor to scrape your metrics
- Build a custom dashboard with at least 5 panels
- Export and commit dashboard JSON
Expose Metrics from Python
Expose Metrics from Python
ServiceMonitor Configuration
ServiceMonitor Configuration
Dashboard Panels
Dashboard Panels
Your dashboard should include:
- Request Rate:
rate(ml_predictions_total[5m]) - Error Rate:
sum(rate(ml_predictions_total{status="error"}[5m])) / sum(rate(ml_predictions_total[5m])) * 100 - Latency Percentiles:
histogram_quantile(0.95, rate(ml_prediction_duration_seconds_bucket[5m])) - Model Distribution: Pie chart of predictions by model
- Resource Usage: CPU and memory from Kubernetes metrics
- Prometheus and Grafana are installed
- Application exposes metrics on
/metricsendpoint - ServiceMonitor successfully scrapes metrics
- Dashboard has at least 5 meaningful panels
- Dashboard JSON is committed to repository
- Screenshots of dashboard included in PR
PR3: Drift Detection
Objective: Add drift detection to your ML pipeline Requirements:- Choose a drift detection method:
- Evidently for Python-based detection
- Seldon Core with Alibi Detect for Kubernetes-native solution
- Implement drift checking in your pipeline (Airflow/Kubeflow/Dagster)
- Define reference data (baseline distribution)
- Configure alerts when drift is detected
- Log drift metrics to monitoring system
Evidently in Airflow DAG
Evidently in Airflow DAG
Seldon with Drift Detector
Seldon with Drift Detector
Kubeflow Pipeline Component
Kubeflow Pipeline Component
- Drift detection is integrated into pipeline
- Reference data is defined and stored
- Drift checks run automatically on schedule or trigger
- Alerts are sent when drift exceeds threshold
- Drift metrics are logged/visualized
- Documentation explains how to update reference data
Monitoring Plan Document
Objective: Design and document your monitoring strategy Requirements: Update your Google Doc with sections covering:System Monitoring
System Monitoring
Document your approach to infrastructure monitoring:
- Metrics tracked: Latency, throughput, error rates, resource usage
- Dashboards: Link to Grafana dashboards
- Alerts: List alert conditions and thresholds
- On-call procedures: How to respond to incidents
ML Monitoring
ML Monitoring
Document your approach to model monitoring:
- Input monitoring: Feature distributions, data quality checks
- Output monitoring: Prediction distributions, confidence scores
- Performance monitoring: Metrics when ground truth is available
- Drift detection: Methods used, thresholds, remediation actions
Ground Truth Collection
Ground Truth Collection
Explain how you collect labels for validation:
- Collection method: User feedback, manual labeling, delayed outcomes, etc.
- Frequency: How often labels are collected
- Coverage: Percentage of predictions that get labeled
- Labeling process: Tools and workflows used
Alert Definitions
Alert Definitions
List all alerts with details:
| Alert Name | Condition | Severity | Action |
|---|---|---|---|
| High Error Rate | error_rate > 5% | Critical | Page on-call |
| Drift Detected | drift_score > 0.1 | Warning | Investigate data |
| High Latency | p95_latency > 2s | Warning | Check resources |
| Low Accuracy | accuracy < 0.85 | Critical | Trigger retraining |
Incident Response
Incident Response
Document your incident response process:
- Detection: How are issues discovered?
- Triage: Who investigates and assigns priority?
- Diagnosis: What tools and data are used?
- Remediation: Common fixes and rollback procedures
- Post-mortem: How are incidents documented and learnings shared?
- Document includes all required sections
- Monitoring plan is specific to your application
- Alert thresholds are justified based on requirements
- Ground truth collection strategy is clearly defined
- Incident response procedures are documented
Evaluation Criteria
Your work will be evaluated based on:- Completeness: All 3 PRs are merged
- Functionality: Monitoring tools work as intended
- Code quality: Clean, well-documented code
- Documentation: Monitoring plan is thorough and clear
- Integration: Monitoring is integrated into existing pipeline
Homework 14: Tools, LLMs, and Data Moat
Key Tasks
PR1: Managed Model Monitoring
Objective: Use a managed monitoring service Options: Requirements:- Sign up for a managed monitoring service
- Integrate your application to send data
- Configure dashboards and alerts
- Document the integration process
PR2: LLM Monitoring
Objective: Add specialized monitoring for LLM applications Requirements:- Integrate LangSmith, AgentOps, or similar tool
- Track token usage and costs
- Monitor prompt-response pairs
- Set up alerts for high costs or latency
PR3: Close the Loop
Objective: Create a labeling pipeline from production data Requirements:- Sample production predictions for labeling
- Create a dataset in a labeling tool (e.g., Argilla, Label Studio)
- Document the labeling workflow
- Show how labeled data feeds back into training
Data Moat Strategy Document
Objective: Plan how to build a competitive advantage with data Requirements: Document in your Google Doc:- Data collection: What production data do you collect?
- Data enrichment: How do you improve data quality over time?
- Feedback loops: How do users provide corrections/labels?
- Model improvement: How is new data used for retraining?
- Competitive advantage: Why is your data unique and valuable?
Reading Materials
Review these resources to deepen your understanding:Homework 13 Readings
- Underspecification Presents Challenges for Credibility in Modern Machine Learning
- How ML Breaks: A Decade of Outages for One Large ML Pipeline
- Data Distribution Shifts and Monitoring
- Monitoring Machine Learning Systems
- Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
Homework 14 Readings
- Evidently ML Observability Course
- LangSmith Documentation
- RLHF: Reinforcement Learning from Human Feedback
- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Tips for Success
Start Early
Monitoring setup can be time-consuming. Begin with simple instrumentation and iterate.
Use Examples
Reference the module examples (sql_app.py, reviewer.py) when implementing observability.
Test Thoroughly
Verify traces and metrics are appearing before submitting PRs.
Document Well
Clear documentation helps reviewers understand your approach and makes maintenance easier.
Getting Help
If you’re stuck:- Review the module documentation and examples
- Check tool documentation (SigNoz, Grafana, Evidently, Seldon)
- Ask in the Discord community
- Look at past student submissions (if available)
Next Steps
Module 8: Cloud Platforms
Learn about deploying ML systems on cloud platforms and buy vs. build decisions