Skip to main content
The console is one pipeline. Get your data in, verify it, optimize the prompt, then later replace the model. This page is the map, so you always know what the next step is and why. It is not a set of instructions. For those, start with the quickstart.

The pipeline

Every task moves through the same stages, in order. You do not have to finish all of them. Most teams stop after the optimized prompt and ship that. The task hub mirrors this exactly. The one active stage is always your next step, and each stage tells you what unlocks the next.

Two ways to get data in

Every task starts with examples, and there are two doors. Pick the one that fits how you work.

Upload a dataset

Fastest path to a score. You already have a CSV or JSONL of inputs and the outputs you want back. Upload it and optimize in the same step.

Connect the SDK

You have a live app. Wire the runapprentice SDK once and it captures real requests as examples over time. Best for learning from production traffic.
Both feed the same pipeline. Upload is faster to a first result; capture is better for keeping a task learning from what your app actually sees.

What “verified” means

Every row carries a tier. The tier decides what a row is allowed to do, so a score is never built on data no one checked. This split is why the two stages unlock at different points. Optimization runs on verified rows, which is gold plus silver. Training runs on gold only. Optimizing improves the prompt, so silver is trusted enough. Training and promoting a model is a quality claim, so it uses human-verified rows alone. For the full reasoning, see data tiers.
You never get a number Apprentice cannot back up. Scores are measured on rows the optimizer never saw, and a run that does not improve is reported as it is, not hidden. A real result you can trust beats a flattering one you cannot.

Where to go next

JSON extraction, end to end

Run the whole pipeline on a real dataset and pull the optimized prompt back into your code.

Why replacement is eval-gated

The reasoning behind the model swap: switched only when your evals prove no quality regression.