Design a Model Monitoring and Alerting System — Practice

Your organization has 200+ ML models in production serving critical business functions — fraud detection, recommendation, pricing, search ranking, and customer churn prediction. Each model team currently monitors their own models with ad-hoc dashboards and alerting, leading to inconsistent coverage, alert fatigue, and incidents that are detected by customer complaints rather than monitoring systems. You need to design a centralized model monitoring and alerting platform.

Scale Requirements

200+ models across 30+ teams, each generating 10K-10M predictions/day
Monitor 1,000+ features for data drift across all models
Detect performance degradation within 15 minutes of onset
Support both real-time models (sub-100ms) and batch models (hourly/daily predictions)
Alert noise target: fewer than 5 false alerts per week across all models

Design Requirements

Define what should be monitored — design a comprehensive monitoring taxonomy.
Design the metric collection and computation pipeline.
Design the alerting system — how to balance sensitivity against alert fatigue.
Explain your approach to root cause analysis when an alert fires.
Describe how monitoring integrates with model retraining and rollback workflows.