apprentice-sdk. Import name: apprentice. This page covers the v0.1 surface. Two rules hold across the SDK:
- Control-plane calls (create, upload, optimize, report) raise
ApprenticeErroron failure. The caller is a developer who needs to know. - Capture calls never raise into your app. They run in your request path and fail open.
Client
| Parameter | Type | Default | Notes |
|---|---|---|---|
api_key | str | None | env APPRENTICE_API_KEY | Sent as a bearer token. |
base_url | str | None | env APPRENTICE_BASE_URL | Required. Raises ApprenticeError if missing. |
timeout | float | 10.0 | Per-request timeout in seconds. |
capture_async | bool | True | Async delivery off the caller thread. Set False for inline delivery in tests or short scripts. |
client.close() flushes buffered traces and closes the HTTP client.
Tasks
A task is one repeatable LLM job. Create a separate task per job you want to improve.| Method | Parameters | Returns | Raises |
|---|---|---|---|
tasks.create(name, metric="auto") | name: str, metric: str | dict with the task, including a created flag | ApprenticeError on API failure |
metric values.
Datasets
| Method | Parameters | Returns | Raises |
|---|---|---|---|
datasets.upload(task, path=, rows=, input_col="input", output_col="output", prompt=None) | one of path or rows | DatasetStatus | ApprenticeError if both or neither of path/rows, or on API failure |
datasets.status(task) | task: str | DatasetStatus | ApprenticeError on API failure |
path (a CSV) or rows (a list of dicts). Rows use either of these shapes:
context must be exactly what the model saw. The optimizer cannot improve a prompt against context the model never received.
Row tiers. Uploaded rows are silver (you curated them). Captured traces are raw (live, unverified). Human-verified rows are gold. Optimization uses verified rows, which is gold plus silver. Eval gates and model promotion use gold only. Silver can help you optimize, but it cannot certify quality.
DatasetStatus fields: task, gold, silver, raw, ready_for_optimization.
Prompts
| Method | Parameters | Returns | Raises |
|---|---|---|---|
prompts.register(task, template) | template: a dict artifact or a LangChain ChatPromptTemplate | dict | ImportError if a LangChain object is passed without the [langchain] extra; ValueError for unsupported template formats; ApprenticeError on API failure |
prompts.get(task, version=None) | version: int | None | PromptVersion | ApprenticeError if no prompt or on API failure |
prompts.history(task) | task: str | list[PromptVersion] | ApprenticeError on API failure |
prompts.to_langchain(task=, artifact=) | one of task or artifact | LangChain ChatPromptTemplate | ApprenticeError if neither is given; ImportError without the [langchain] extra; ValueError for a bad artifact |
format, messages, and input_variables. Only f-string templates are supported. PromptVersion fields: task, version, text, score.
Optimize jobs
| Method | Parameters | Returns | Raises |
|---|---|---|---|
optimize(task) | task: str | Job | ApprenticeError on API failure, including too few verified rows |
job(job_id) | job_id: str | Job | ApprenticeError on API failure |
Job.refresh() | none | Job | ApprenticeError on API failure |
Job.wait(poll_seconds=5.0, timeout_seconds=3600.0) | floats | Job | ApprenticeError on timeout or API failure |
Job.report() | none | OptimizationReport | ApprenticeError if the job has no report |
Job properties: job_id, status (queued, running, succeeded, failed).
OptimizationReport fields: task, baseline_score, optimized_score, examples_used, optimized_prompt, optimized_template, detail. Scores can be None when the run did not produce a comparable number, so check before formatting.
Capture
The capture path logs one completed call. It never raises into your application.| Method | Parameters | Returns | Behavior |
|---|---|---|---|
capture(task, output, *, input=, inputs=, model=, latency_ms=, prompt_tokens=, completion_tokens=, error=, metadata=) | task and output required; one of input or inputs | trace_id or None | Never raises. In async mode returns None and delivers off-thread; in sync mode returns the trace_id. |
flush_captures(timeout=2.0) | timeout: float | None | Blocks until buffered traces are delivered. No-op in sync mode. |
post_trace_failopen(record) | record: TraceRecord | trace_id or None | The low-level escape hatch. Never raises. |
inputs={"question": ..., "context": exact_context}, not a single rendered prompt string. Do not wrap capture in a try/except to protect your app, it is already fail-open.
Feedback
| Method | Parameters | Returns | Raises |
|---|---|---|---|
feedback(trace_id, good=None, score=None, note=None) | trace_id: str and at least one signal | None | ApprenticeError on API failure |
LangChain callback
| Constructor | Parameters | Behavior |
|---|---|---|
ApprenticeCallback(task, client, redact=None) | task: str, client: Apprentice, redact: Callable[[str], str] | None | Captures chat-model calls for the task. Records retrieved context for simple LangChain RAG chains. Never raises into your chain. |
Metrics
metric | Scored by | Use for |
|---|---|---|
auto | inferred from your rows | let the backend choose |
json_f1 | deterministic | JSON or structured extraction |
text_f1 | deterministic | short free-text answers |
semantic_f1 | LLM judge (estimate) | free-text or RAG answers, the default for RAG |
rag_faithfulness | LLM judge (estimate) | whether every claim is supported by the context |
rag_composite | LLM judge (estimate) | RAG grounding and correct refusal together |
semantic_f1; pass metric="rag_composite" when you want grounding and refusal optimized together.
Errors and debugging
ApprenticeError is raised for control-plane failures, with a message that tells you what to do. Turn on request logging with: