MLOps·ML System Design·Medium
Design an ML Experiment Tracking Platform
Asked at Weights & Biases, MLflow, Neptune
Your ML team runs hundreds of experiments weekly — different model architectures, hyperparameters, feature sets, and training data configurations. Currently, results are tracked in spreadsheets and shared via Slack messages. Experiments are frequently lost ("what learning rate did we use for that good run last month?"), results can't be compared systematically, and reproducing a past experiment requires detective work. Design an experiment tracking platform to solve these problems.
Scale Requirements
- 50 data scientists running 500+ experiments/week
- Each experiment logs 10-100 metrics (loss, accuracy, AUC, custom metrics) at 100-10,000 steps
- Total metric data volume: 10M+ data points/day
- Experiments may run from 5 minutes to 7 days (distributed training)
- Users need to compare 10-50 experiments simultaneously in visualizations
Design Requirements
- Design the experiment data model — what metadata should each experiment track?
- Design the logging API that data scientists use from their training code.
- Design the visualization and comparison system for analyzing experiments.
- Address scalability — how to handle 10M+ data points per day?
- Design collaboration features that help teams work together on experiments.
Your Answer
Unlock AI-powered scoring, all questions, and progress tracking.