Tech Abstractions
MLOps·ML System Design·Hard

Design a Model Monitoring and Alerting System

Asked at Google, Uber, DoorDash

Your organization has 200+ ML models in production serving critical business functions — fraud detection, recommendation, pricing, search ranking, and customer churn prediction. Each model team currently monitors their own models with ad-hoc dashboards and alerting, leading to inconsistent coverage, alert fatigue, and incidents that are detected by customer complaints rather than monitoring systems. You need to design a centralized model monitoring and alerting platform.

Scale Requirements

  • 200+ models across 30+ teams, each generating 10K-10M predictions/day
  • Monitor 1,000+ features for data drift across all models
  • Detect performance degradation within 15 minutes of onset
  • Support both real-time models (sub-100ms) and batch models (hourly/daily predictions)
  • Alert noise target: fewer than 5 false alerts per week across all models

Design Requirements

  1. Define what should be monitored — design a comprehensive monitoring taxonomy.
  2. Design the metric collection and computation pipeline.
  3. Design the alerting system — how to balance sensitivity against alert fatigue.
  4. Explain your approach to root cause analysis when an alert fires.
  5. Describe how monitoring integrates with model retraining and rollback workflows.

Your Answer

Unlock AI-powered scoring, all questions, and progress tracking.

Study the related chapter →