Tech Abstractions
MLOps·ML System Design·Hard

Design a Continuous Training Pipeline

Asked at Netflix, Spotify, Uber

Your team manually retrains ML models every 2-3 weeks, which causes two problems: model performance degrades noticeably between retraining cycles as data drifts, and the manual process is error-prone — a recent retraining accidentally used stale features and degraded a production model for 4 hours before being caught. Design an automated continuous training pipeline that keeps models fresh without sacrificing reliability.

Scale Requirements

  • 50 models to be continuously trained, each with different data sources and retraining cadences
  • Training data volumes: 1GB to 500GB per model per training run
  • Training compute: mix of CPU (gradient boosting, scikit-learn) and GPU (deep learning, transformers)
  • New training data arrives continuously — 10TB/day total across all models
  • Maximum acceptable time from "model needs retraining" to "new model in production": 4 hours

Design Requirements

  1. Design the trigger mechanism — what events should trigger a retraining run?
  2. Design the automated training data generation pipeline.
  3. Design the model validation and deployment gating system.
  4. Design how this integrates with experiment tracking and model registry.
  5. Explain your failure recovery strategy — what if a retraining produces a worse model?

Your Answer

Unlock AI-powered scoring, all questions, and progress tracking.

Study the related chapter →