Tech Abstractions
MLOps·ML System Design·Medium

Design an ML Experiment Tracking Platform

Asked at Weights & Biases, MLflow, Neptune

Your ML team runs hundreds of experiments weekly — different model architectures, hyperparameters, feature sets, and training data configurations. Currently, results are tracked in spreadsheets and shared via Slack messages. Experiments are frequently lost ("what learning rate did we use for that good run last month?"), results can't be compared systematically, and reproducing a past experiment requires detective work. Design an experiment tracking platform to solve these problems.

Scale Requirements

  • 50 data scientists running 500+ experiments/week
  • Each experiment logs 10-100 metrics (loss, accuracy, AUC, custom metrics) at 100-10,000 steps
  • Total metric data volume: 10M+ data points/day
  • Experiments may run from 5 minutes to 7 days (distributed training)
  • Users need to compare 10-50 experiments simultaneously in visualizations

Design Requirements

  1. Design the experiment data model — what metadata should each experiment track?
  2. Design the logging API that data scientists use from their training code.
  3. Design the visualization and comparison system for analyzing experiments.
  4. Address scalability — how to handle 10M+ data points per day?
  5. Design collaboration features that help teams work together on experiments.

Your Answer

Unlock AI-powered scoring, all questions, and progress tracking.

Study the related chapter →