Python SDK reference

Package: apprentice-sdk. Import name: apprentice. This page covers the v0.1 surface. Two rules hold across the SDK:

Control-plane calls (create, upload, optimize, report) raise ApprenticeError on failure. The caller is a developer who needs to know.
Capture calls never raise into your app. They run in your request path and fail open.

Client

from apprentice import Apprentice

client = Apprentice(api_key="ap_live_...", base_url="https://your-backend")

Parameter	Type	Default	Notes
`api_key`	`str \| None`	env `APPRENTICE_API_KEY`	Sent as a bearer token.
`base_url`	`str \| None`	env `APPRENTICE_BASE_URL`	Required. Raises `ApprenticeError` if missing.
`timeout`	`float`	`10.0`	Per-request timeout in seconds.
`capture_async`	`bool`	`True`	Async delivery off the caller thread. Set `False` for inline delivery in tests or short scripts.

client.close() flushes buffered traces and closes the HTTP client.

Tasks

A task is one repeatable LLM job. Create a separate task per job you want to improve.

client.tasks.create("invoice-json", metric="json_f1")

Method	Parameters	Returns	Raises
`tasks.create(name, metric="auto")`	`name: str`, `metric: str`	`dict` with the task, including a `created` flag	`ApprenticeError` on API failure

See Metrics for the accepted metric values.

Datasets

status = client.datasets.upload("invoice-json", path="golden.csv", prompt="...")

Method	Parameters	Returns	Raises
`datasets.upload(task, path=, rows=, input_col="input", output_col="output", prompt=None)`	one of `path` or `rows`	`DatasetStatus`	`ApprenticeError` if both or neither of `path`/`rows`, or on API failure
`datasets.status(task)`	`task: str`	`DatasetStatus`	`ApprenticeError` on API failure

Provide exactly one of path (a CSV) or rows (a list of dicts). Rows use either of these shapes:

# legacy: a single input string
{"input": "raw text", "output": "expected output"}

# structured: the canonical shape, required for multi-field and RAG tasks
{"inputs": {"question": "...", "context": "..."}, "output": "expected output"}

For RAG, context must be exactly what the model saw. The optimizer cannot improve a prompt against context the model never received. Row tiers. Uploaded rows are silver (you curated them). Captured traces are raw (live, unverified). Human-verified rows are gold. Optimization uses verified rows, which is gold plus silver. Eval gates and model promotion use gold only. Silver can help you optimize, but it cannot certify quality. DatasetStatus fields: task, gold, silver, raw, ready_for_optimization.

Prompts

client.prompts.register("invoice-json", template)
best = client.prompts.get("invoice-json")

Method	Parameters	Returns	Raises
`prompts.register(task, template)`	`template`: a dict artifact or a LangChain `ChatPromptTemplate`	`dict`	`ImportError` if a LangChain object is passed without the `[langchain]` extra; `ValueError` for unsupported template formats; `ApprenticeError` on API failure
`prompts.get(task, version=None)`	`version: int \| None`	`PromptVersion`	`ApprenticeError` if no prompt or on API failure
`prompts.history(task)`	`task: str`	`list[PromptVersion]`	`ApprenticeError` on API failure
`prompts.to_langchain(task=, artifact=)`	one of `task` or `artifact`	LangChain `ChatPromptTemplate`	`ApprenticeError` if neither is given; `ImportError` without the `[langchain]` extra; `ValueError` for a bad artifact

The raw template artifact is a dict with format, messages, and input_variables. Only f-string templates are supported. PromptVersion fields: task, version, text, score.

Optimize jobs

report = client.optimize("invoice-json").wait().report()

Method	Parameters	Returns	Raises
`optimize(task)`	`task: str`	`Job`	`ApprenticeError` on API failure, including too few verified rows
`job(job_id)`	`job_id: str`	`Job`	`ApprenticeError` on API failure
`Job.refresh()`	none	`Job`	`ApprenticeError` on API failure
`Job.wait(poll_seconds=5.0, timeout_seconds=3600.0)`	floats	`Job`	`ApprenticeError` on timeout or API failure
`Job.report()`	none	`OptimizationReport`	`ApprenticeError` if the job has no report

Job properties: job_id, status (queued, running, succeeded, failed). OptimizationReport fields: task, baseline_score, optimized_score, examples_used, optimized_prompt, optimized_template, detail. Scores can be None when the run did not produce a comparable number, so check before formatting.

Capture

The capture path logs one completed call. It never raises into your application.

trace_id = client.capture(
    "support-bot",
    output=answer,
    inputs={"question": question, "context": context},
    model="gpt-5-mini",
)

Method	Parameters	Returns	Behavior
`capture(task, output, *, input=, inputs=, model=, latency_ms=, prompt_tokens=, completion_tokens=, error=, metadata=)`	`task` and `output` required; one of `input` or `inputs`	`trace_id` or `None`	Never raises. In async mode returns `None` and delivers off-thread; in sync mode returns the `trace_id`.
`flush_captures(timeout=2.0)`	`timeout: float`	`None`	Blocks until buffered traces are delivered. No-op in sync mode.
`post_trace_failopen(record)`	`record: TraceRecord`	`trace_id` or `None`	The low-level escape hatch. Never raises.

For RAG, pass inputs={"question": ..., "context": exact_context}, not a single rendered prompt string. Do not wrap capture in a try/except to protect your app, it is already fail-open.

Feedback

client.feedback(trace_id, good=True, note="resolved the ticket")

Method	Parameters	Returns	Raises
`feedback(trace_id, good=None, score=None, note=None)`	`trace_id: str` and at least one signal	`None`	`ApprenticeError` on API failure

LangChain callback

from apprentice.langchain import ApprenticeCallback

model = init_chat_model("gpt-5-mini", callbacks=[ApprenticeCallback("support-bot", client)])

Constructor	Parameters	Behavior
`ApprenticeCallback(task, client, redact=None)`	`task: str`, `client: Apprentice`, `redact: Callable[[str], str] \| None`	Captures chat-model calls for the task. Records retrieved context for simple LangChain RAG chains. Never raises into your chain.

Metrics

`metric`	Scored by	Use for
`auto`	inferred from your rows	let the backend choose
`json_f1`	deterministic	JSON or structured extraction
`text_f1`	deterministic	short free-text answers
`semantic_f1`	LLM judge (estimate)	free-text or RAG answers, the default for RAG
`rag_faithfulness`	LLM judge (estimate)	whether every claim is supported by the context
`rag_composite`	LLM judge (estimate)	RAG grounding and correct refusal together

Judge-scored metrics are advisory estimates, not deterministic truth. RAG rows auto-route to semantic_f1; pass metric="rag_composite" when you want grounding and refusal optimized together.

Errors and debugging

ApprenticeError is raised for control-plane failures, with a message that tells you what to do. Turn on request logging with:

import apprentice
apprentice.enable_debug_logging()   # or set APPRENTICE_DEBUG=1

​Client

​Tasks

​Datasets

​Prompts

​Optimize jobs

​Capture

​Feedback

​LangChain callback

​Metrics

​Errors and debugging