Design a Continuous Training Pipeline — Practice

Your team manually retrains ML models every 2-3 weeks, which causes two problems: model performance degrades noticeably between retraining cycles as data drifts, and the manual process is error-prone — a recent retraining accidentally used stale features and degraded a production model for 4 hours before being caught. Design an automated continuous training pipeline that keeps models fresh without sacrificing reliability.

Scale Requirements

50 models to be continuously trained, each with different data sources and retraining cadences
Training data volumes: 1GB to 500GB per model per training run
Training compute: mix of CPU (gradient boosting, scikit-learn) and GPU (deep learning, transformers)
New training data arrives continuously — 10TB/day total across all models
Maximum acceptable time from "model needs retraining" to "new model in production": 4 hours

Design Requirements

Design the trigger mechanism — what events should trigger a retraining run?
Design the automated training data generation pipeline.
Design the model validation and deployment gating system.
Design how this integrates with experiment tracking and model registry.
Explain your failure recovery strategy — what if a retraining produces a worse model?