MLOps·ML System Design·Medium
Design a Model Deployment and Rollback Strategy
Asked at Netflix, Google, Amazon
Your team deploys ML models behind REST APIs that serve millions of predictions per hour. A recent model update caused a 3-hour outage because the deployment replaced the old model before verifying the new one worked correctly, and rolling back required a full rebuild of the old serving container. Design a deployment and rollback strategy that enables safe, fast model updates with minimal risk.
Constraints
- 100+ model endpoints serving production traffic
- Deployment frequency: 10-50 deployments/day across all models
- Required: zero-downtime deployments (no dropped requests during model swap)
- Required: rollback in under 2 minutes if a deployed model is degraded
- Models range from 10MB to 10GB in size (embedding tables, large transformer models)
Design Requirements
- Compare deployment strategies (blue-green, canary, rolling, shadow) and recommend one.
- Design the traffic routing system that directs requests to the right model version.
- Design the health check and automated rollback mechanism.
- Address the challenge of large model loading times (a 10GB model takes minutes to load).
- Explain how you validate a deployment before making it production.
Your Answer
Unlock AI-powered scoring, all questions, and progress tracking.