Tech Abstractions
ML System Design·Medium

Design a Multi-Tenant ML Inference Platform

Asked at Microsoft, Databricks, Sagemaker

Your company is building an internal ML inference platform that will serve 200+ models across 30+ product teams. Each team has different requirements for latency, throughput, model framework (PyTorch, TensorFlow, ONNX), and GPU needs. You need to design a multi-tenant platform that provides isolation, fair resource allocation, and operational simplicity.

Constraints

  • 200+ models in production, growing 20% quarterly
  • Workloads range from sub-10ms real-time inference to batch inference on TB-scale datasets
  • GPU clusters are expensive — utilization must exceed 70%
  • Teams must not interfere with each other (noisy neighbor problem)
  • Platform must support canary deployments, A/B testing, and automatic rollback

Design Requirements

  1. Design the tenant isolation model — how do you separate teams' workloads?
  2. Design the model deployment pipeline from model artifact to serving endpoint.
  3. Explain your auto-scaling strategy for heterogeneous inference workloads.
  4. Design the API surface that client teams use to deploy and query models.
  5. Describe how you would implement monitoring and cost attribution per tenant.

Your Answer

Unlock AI-powered scoring, all questions, and progress tracking.

Study the related chapter →