Skip to main content
Apprentice does two things, in order:
  1. Optimize the prompt. Give it a dataset of inputs and correct outputs for one task. It runs prompt optimization (DSPy GEPA) and reports the score change on held-out rows.
  2. Replace the model. Once you have enough verified data, train a small model to take over from the frontier model. The switch is gated on evals, with instant rollback. This second feature is still being built; pages that describe it are marked Building.
You start with feature one today.

Quickstart

Go from a CSV to an optimized prompt in under ten lines.

JSON extraction

A full run for the first task class, end to end.

Capture from LangChain

Log your production calls with one callback, no code changes.

Python SDK reference

Every method, its parameters, and what it returns.

What you can prove today

The prompt-optimization layer is real and reproducible. On a public JSON extraction set (100 examples, 70 train, 30 held out), GPT-4o-mini went from 83.1 to 85.6 with GEPA, and a fine-tuned Qwen3.5-4B went from 69.1 to 88.9. You can run it yourself: apprentice-benchmark.

How we write these docs

Every number, tier, and behavior on this site matches the code. If a feature is not shipped, the page says so. A run that does not improve is reported as a real result, not hidden. If you find a claim that drifts from what the SDK does, it is a bug, tell us.