MLOps·ML System Design·Hard
Design a Continuous Training Pipeline
Asked at Netflix, Spotify, Uber
Your team manually retrains ML models every 2-3 weeks, which causes two problems: model performance degrades noticeably between retraining cycles as data drifts, and the manual process is error-prone — a recent retraining accidentally used stale features and degraded a production model for 4 hours before being caught. Design an automated continuous training pipeline that keeps models fresh without sacrificing reliability.
Scale Requirements
- 50 models to be continuously trained, each with different data sources and retraining cadences
- Training data volumes: 1GB to 500GB per model per training run
- Training compute: mix of CPU (gradient boosting, scikit-learn) and GPU (deep learning, transformers)
- New training data arrives continuously — 10TB/day total across all models
- Maximum acceptable time from "model needs retraining" to "new model in production": 4 hours
Design Requirements
- Design the trigger mechanism — what events should trigger a retraining run?
- Design the automated training data generation pipeline.
- Design the model validation and deployment gating system.
- Design how this integrates with experiment tracking and model registry.
- Explain your failure recovery strategy — what if a retraining produces a worse model?
Your Answer
Unlock AI-powered scoring, all questions, and progress tracking.