Apprentice docs: cut LLM cost without losing quality

This page explains a capability that is coming soon. Today Apprentice ships the prompt-optimization SDK. The shadow router, canary rollout, and automatic takeover described here are in development. The design is settled; the public flow is not live yet.

Replacing a frontier model with a smaller one is where most cost-cutting attempts quietly lose quality. Apprentice’s answer is to never let the smaller model take traffic on faith. Every shift is gated by your own evaluation, and the frontier model is always one step behind, ready to take back over.

The problem with a naive swap

A 4B model is cheap to serve, but a blind swap trades a known-good model for an unknown one. Quality regressions on real traffic are easy to miss until a customer finds them. The cost saving is real; the risk is that you cannot see the regression until it has already shipped.

The gate

The smaller model only earns traffic by passing a gate measured on gold rows only, never on model-generated data. This is the same honesty rule as the data tiers: the model is never graded against another model’s output. If it does not clear your threshold on human-verified rows, your frontier model keeps serving and nothing changes for your users. You set the thresholds. They are yours to configure per task, because what “good enough” means is your call, not ours.

Why a canary, not a switch

Even a model that passes the gate goes live gradually, not all at once. A small share of live traffic moves to the smaller model first, measured against the frontier model it is replacing. Only when it holds up does its share grow. This is the standard safe-rollout pattern: contain the blast radius of any surprise to a fraction of requests.

Why fallbacks count against the saving

When the smaller model is unsure and the request falls back to the frontier model, that fallback is counted as a failed replacement in the savings math. Hiding fallbacks would inflate the reported saving and quietly reintroduce the cost you were trying to cut. Counting them keeps the number honest: the saving you see is the saving you got.

Why rollback is always one click

The frontier model stays on hot standby for the life of the takeover. If anything looks wrong, traffic reverts in under a second, with an audit-log entry on every promotion and rollback. The takeover is reversible by design, so adopting it is never a one-way door.

The bigger picture

The throughline is that cost reduction must not cost you quality, and the only way to prove that is to measure against verified truth before and during every shift. Prompt optimization, which ships today, is the first half of this: it improves the prompt against the same verified dataset. Model replacement is the second half, and it reuses the same gate.

Data tiers: gold, silver, raw Python SDK reference

⌘I

​The problem with a naive swap

​The gate

​Why a canary, not a switch

​Why fallbacks count against the saving

​Why rollback is always one click

​The bigger picture

The problem with a naive swap

The gate

Why a canary, not a switch

Why fallbacks count against the saving

Why rollback is always one click

The bigger picture