MLOps·ML System Design·Hard
Design a Model Monitoring and Alerting System
Asked at Google, Uber, DoorDash
Your organization has 200+ ML models in production serving critical business functions — fraud detection, recommendation, pricing, search ranking, and customer churn prediction. Each model team currently monitors their own models with ad-hoc dashboards and alerting, leading to inconsistent coverage, alert fatigue, and incidents that are detected by customer complaints rather than monitoring systems. You need to design a centralized model monitoring and alerting platform.
Scale Requirements
- 200+ models across 30+ teams, each generating 10K-10M predictions/day
- Monitor 1,000+ features for data drift across all models
- Detect performance degradation within 15 minutes of onset
- Support both real-time models (sub-100ms) and batch models (hourly/daily predictions)
- Alert noise target: fewer than 5 false alerts per week across all models
Design Requirements
- Define what should be monitored — design a comprehensive monitoring taxonomy.
- Design the metric collection and computation pipeline.
- Design the alerting system — how to balance sensitivity against alert fatigue.
- Explain your approach to root cause analysis when an alert fires.
- Describe how monitoring integrates with model retraining and rollback workflows.
Your Answer
Unlock AI-powered scoring, all questions, and progress tracking.