MLOps·ML System Design·Easy
Design a Data Validation Pipeline for ML
Asked at Google, Airbnb, Uber
Your ML models keep breaking in production due to data quality issues: a new data provider changes the schema of an API response, a logging bug introduces null values in a critical feature, and a seasonal shift changes the distribution of user behavior data. By the time anyone notices, the model has been making bad predictions for hours. Design a data validation system that catches data quality issues before they corrupt model predictions.
Scale Requirements
- Validate 100+ data pipelines feeding ML models
- Data volumes: 1GB to 1TB/day per pipeline
- Detect schema violations and data drift within 5 minutes of data arrival
- Support batch data (daily/hourly) and streaming data (real-time features)
Design Requirements
- Define the types of data validation checks your system should perform.
- Design the validation pipeline — when and how are checks executed?
- Explain how to detect drift between training data and production data.
- Design the alerting and incident response workflow for data quality issues.
- Discuss how validation rules are created, maintained, and evolved over time.
Your Answer
Unlock AI-powered scoring, all questions, and progress tracking.