Testing and Deployment¶

Beyond the replica-level benchmarking you did configuring the engine, a production system must be tested end-to-end before deploying — and deployed in a way that can't take down live traffic.

Testing strategies¶

Manual testing — scripts (or button clicks) sending synthetic traffic to the service.
Load testing — automatically sending a large volume to test scaling and sustained performance.
Shadow traffic — copying live traffic to a deployment to measure performance under real-world conditions, without affecting users.

Testing inference is expensive — engineering time to build/measure it, plus GPUs to serve the test traffic. Minimize the cost: e.g. start shadow testing with a random sample of production traffic, followed by a shorter load test. And remember AI usage fluctuates on daily and weekly cycles — a test at 3 a.m. Sunday tells you little about Monday's peak.

7.4.1 Zero-downtime deployment¶

The traditional high-availability deploy is blue-green: two identical environments — the live blue and the updated green. When green is ready, all traffic cuts over at once; blue stays ready for rollback.

Blue-green doesn't fit large-scale inference

It doubles your GPUs. If blue runs 100 GPUs, green needs another 100 before you can cut over — the same capacity/cost problem that makes inference testing hard. For GPU-heavy services, blue-green is usually unaffordable.

The GPU-frugal alternative is a canary deployment (named for the canaries that warned coal miners) — catch errors on a small slice of traffic before they hit everyone:

            ┌──► OLD DEPLOYMENT  (most traffic)
 TRAFFIC ───┤
            └──► NEW DEPLOYMENT  (small, growing %)   ← watch it here

Build the new deployment and get it ready for requests.
Route a small percentage of live traffic to it.
Monitor — is it handling traffic correctly? Revert if not.
Gradually raise its share, monitoring, until it serves 100%.

Canary can ramp in minutes or roll out slowly for stability. And unlike blue-green, it barely adds cost at scale: shifting traffic off the old deployment scales it down, so you're not paying for two full fleets.

Keep the canary warm

With autoscaling, a brand-new deployment starts at minimum replicas. During the ramp, make sure the new deployment always has enough active replicas for the traffic you're sending it — otherwise its requests queue behind cold starts and users see a latency spike. Ramp the replicas alongside the traffic.

7.4.2 Cost estimation¶

Moving from a public API to dedicated GPUs changes how you think about cost — the whole point is to escape per-token pricing and own your unit economics, but it makes estimation harder.

Public API cost is simple — a linear function of usage:

# per-token API
total_input_tokens  = 1000   # millions
total_output_tokens = 500    # millions
price_per_million_in  = 1.25
price_per_million_out = 10

input_cost  = total_input_tokens  * price_per_million_in    # 1250
output_cost = total_output_tokens * price_per_million_out   # 5000
total_cost  = input_cost + output_cost                      # $6,250

Dedicated cost is a function of many variables — batch sizing (latency vs throughput tuning), traffic patterns (are GPUs saturated or idle?), and sequence lengths (input/output tokens, average and outlier). Rather than reverse-engineer a per-token price from your GPU bill, convert the other way — turn your token usage into a total and compare:

# dedicated deployment
total_gpu_hours    = 1600
price_per_gpu_hour = 3.50
total_cost = total_gpu_hours * price_per_gpu_hour           # $5,600

Here dedicated (\(5,600) beats the API (\)6,250) — but only at this usage level and utilization. Below some volume the API wins; the crossover is the "are we ready for dedicated infra?" decision.

Use a long horizon, and count engineering time (TCO)

Estimate over at least a week to smooth daily/weekly cycles — a single day misleads. And the GPU bill isn't the whole story: the engineering time to build and maintain the inference system is a real cost. Add it to the GPU spend for true total cost of ownership (TCO). Dedicated inference buys reliability, security, and control — but those have a payroll line, not just a hardware line.

7.4.3 Observability¶

Inference is mission-critical, so monitor it like any mission-critical component — alerting, logs, and observability at the right level of abstraction.

What to measure:

Metric	What it tells you
Total volume	requests a deployment is receiving
Request/response sizes	input and output sequence lengths
Response codes	counts of 2XX / 4XX / 5XX from the model server
Latency	TTFT, TPS, end-to-end — at p50, p90, p99
Replica count	instances serving + instances starting up
Utilization	CPU, host memory, GPU, GPU memory
Queue depth	requests enqueued and waiting (for async traffic)

Metrics are only useful together

They're interdependent — a latency spike could be request volume, or it could be a few long-input-sequence requests. Seeing the metrics side by side is what turns "what is happening" into "why." A p99 latency alert next to a flat volume graph but a spiking input-size graph tells the whole story at a glance.

When things break, you need logs — both server logs and audit logs (who changed the inference service, and when) — delivered in real time.

Don't silo inference observability

Build it with deep integration into your existing tooling — Grafana, Datadog, PagerDuty, Sentry — so inference metrics sit in context next to the rest of the application. An inference dashboard nobody looks at because it's in a separate system is worse than no dashboard.

Next: Client Code →