The three tiers that matter
- Gold is human-verified. A person confirmed the output is correct for the input. Gold is the only tier trusted to certify quality.
- Silver is frontier-model output that passed a deterministic check (for example, valid JSON for a JSON task). It is plausible but not human-confirmed.
- Raw is everything else: live captured traffic that no one has verified yet.
rejected and drops out of every count.
What each tier is allowed to do
Two rules hold across the product, and they do not overlap:- Optimization uses verified rows: gold plus silver. More verified rows give the optimizer more signal, so silver helps here.
- Eval gates and model promotion use gold only. Quality is certified against human-verified rows, never against model-generated ones.
Why the split matters
If silver could certify quality, you would be grading the model with the model’s own homework: frontier output checking frontier output. That hides regressions instead of catching them. By letting silver help you optimize but never letting it certify, the gate stays honest. A run that does not beat the baseline on gold is a real result, not a tuning artifact. TheDatasetStatus returned by the SDK reports the counts directly: gold, silver, raw, and ready_for_optimization. See the Python SDK reference for the fields.