Apprentice docs: cut LLM cost without losing quality

Apprentice sorts every row in a task’s dataset by tier. The tier decides what a row is allowed to do: drive optimization, or certify quality. Getting this right is the difference between a real result and a flattering one.

The three tiers that matter

Gold is human-verified. A person confirmed the output is correct for the input. Gold is the only tier trusted to certify quality.
Silver is frontier-model output that passed a deterministic check (for example, valid JSON for a JSON task). It is plausible but not human-confirmed.
Raw is everything else: live captured traffic that no one has verified yet.

A captured trace arrives as raw. You promote it to gold by verifying it, or you upload already-curated rows as silver. A row you reject during review is marked rejected and drops out of every count.

What each tier is allowed to do

Two rules hold across the product, and they do not overlap:

Optimization uses verified rows: gold plus silver. More verified rows give the optimizer more signal, so silver helps here.
Eval gates and model promotion use gold only. Quality is certified against human-verified rows, never against model-generated ones.

Raw never counts toward either. It is a candidate for verification, not evidence.

Why the split matters

If silver could certify quality, you would be grading the model with the model’s own homework: frontier output checking frontier output. That hides regressions instead of catching them. By letting silver help you optimize but never letting it certify, the gate stays honest. A run that does not beat the baseline on gold is a real result, not a tuning artifact. The DatasetStatus returned by the SDK reports the counts directly: gold, silver, raw, and ready_for_optimization. See the Python SDK reference for the fields.

​The three tiers that matter

​What each tier is allowed to do

​Why the split matters

The three tiers that matter

What each tier is allowed to do

Why the split matters