Tech Abstractions
MLOps·ML System Design·Medium

Design a Model Deployment and Rollback Strategy

Asked at Netflix, Google, Amazon

Your team deploys ML models behind REST APIs that serve millions of predictions per hour. A recent model update caused a 3-hour outage because the deployment replaced the old model before verifying the new one worked correctly, and rolling back required a full rebuild of the old serving container. Design a deployment and rollback strategy that enables safe, fast model updates with minimal risk.

Constraints

  • 100+ model endpoints serving production traffic
  • Deployment frequency: 10-50 deployments/day across all models
  • Required: zero-downtime deployments (no dropped requests during model swap)
  • Required: rollback in under 2 minutes if a deployed model is degraded
  • Models range from 10MB to 10GB in size (embedding tables, large transformer models)

Design Requirements

  1. Compare deployment strategies (blue-green, canary, rolling, shadow) and recommend one.
  2. Design the traffic routing system that directs requests to the right model version.
  3. Design the health check and automated rollback mechanism.
  4. Address the challenge of large model loading times (a 10GB model takes minutes to load).
  5. Explain how you validate a deployment before making it production.

Your Answer

Unlock AI-powered scoring, all questions, and progress tracking.

Study the related chapter →