Design a Multi-Tenant ML Inference Platform — Practice

Your company is building an internal ML inference platform that will serve 200+ models across 30+ product teams. Each team has different requirements for latency, throughput, model framework (PyTorch, TensorFlow, ONNX), and GPU needs. You need to design a multi-tenant platform that provides isolation, fair resource allocation, and operational simplicity.

Constraints

200+ models in production, growing 20% quarterly
Workloads range from sub-10ms real-time inference to batch inference on TB-scale datasets
GPU clusters are expensive — utilization must exceed 70%
Teams must not interfere with each other (noisy neighbor problem)
Platform must support canary deployments, A/B testing, and automatic rollback

Design Requirements

Design the tenant isolation model — how do you separate teams' workloads?
Design the model deployment pipeline from model artifact to serving endpoint.
Explain your auto-scaling strategy for heterogeneous inference workloads.
Design the API surface that client teams use to deploy and query models.
Describe how you would implement monitoring and cost attribution per tenant.