ML System Design·Medium
Design a Multi-Tenant ML Inference Platform
Asked at Microsoft, Databricks, Sagemaker
Your company is building an internal ML inference platform that will serve 200+ models across 30+ product teams. Each team has different requirements for latency, throughput, model framework (PyTorch, TensorFlow, ONNX), and GPU needs. You need to design a multi-tenant platform that provides isolation, fair resource allocation, and operational simplicity.
Constraints
- 200+ models in production, growing 20% quarterly
- Workloads range from sub-10ms real-time inference to batch inference on TB-scale datasets
- GPU clusters are expensive — utilization must exceed 70%
- Teams must not interfere with each other (noisy neighbor problem)
- Platform must support canary deployments, A/B testing, and automatic rollback
Design Requirements
- Design the tenant isolation model — how do you separate teams' workloads?
- Design the model deployment pipeline from model artifact to serving endpoint.
- Explain your auto-scaling strategy for heterogeneous inference workloads.
- Design the API surface that client teams use to deploy and query models.
- Describe how you would implement monitoring and cost attribution per tenant.
Your Answer
Unlock AI-powered scoring, all questions, and progress tracking.