Tech Abstractions
MLOps·ML System Design·Easy

Design a Data Validation Pipeline for ML

Asked at Google, Airbnb, Uber

Your ML models keep breaking in production due to data quality issues: a new data provider changes the schema of an API response, a logging bug introduces null values in a critical feature, and a seasonal shift changes the distribution of user behavior data. By the time anyone notices, the model has been making bad predictions for hours. Design a data validation system that catches data quality issues before they corrupt model predictions.

Scale Requirements

  • Validate 100+ data pipelines feeding ML models
  • Data volumes: 1GB to 1TB/day per pipeline
  • Detect schema violations and data drift within 5 minutes of data arrival
  • Support batch data (daily/hourly) and streaming data (real-time features)

Design Requirements

  1. Define the types of data validation checks your system should perform.
  2. Design the validation pipeline — when and how are checks executed?
  3. Explain how to detect drift between training data and production data.
  4. Design the alerting and incident response workflow for data quality issues.
  5. Discuss how validation rules are created, maintained, and evolved over time.

Your Answer

Unlock AI-powered scoring, all questions, and progress tracking.

Study the related chapter →