Scaling an AI gateway to 14B requests/month
When we crossed ten billion monthly requests, the hard problems stopped being about throughput and started being about consistency. A gateway that's fast on average but unpredictable at the tail is worse than a slightly slower one that never surprises you. Here's how we keep p99 flat while routing across a dozen upstream providers.
1. Treat routing as a budget, not a switch
Every request carries an implicit latency and cost budget. Instead of hard-coding a provider, the gateway scores candidates in real time on health, price and recent latency, then spends the budget on the best fit. The special auto model exposes this directly to customers.
2. Fail over before the client notices
Upstreams degrade gracefully far more often than they hard-fail. We watch streaming token cadence and abort-and-retry on a hedged second request when the first stalls past a percentile threshold — so a slow provider never becomes a slow response.
3. Cache what's safe, never what isn't
Semantic caching is powerful and dangerous. We scope cache keys by tenant, model and a content hash, and we let customers mark routes as never-cache. The result: double-digit cost savings with zero cross-tenant bleed.
4. Make every hop observable
You can't tune what you can't see. Each request emits a structured trace — auth, route decision, cache status, upstream timing — that powers both our dashboards and customer live-tail. Observability isn't a bolt-on; it's the substrate everything else stands on.
The throughline is boring on purpose: predictability beats peak performance. If you're building on top of multiple AI providers and want this for free, the quickstart takes about five minutes.