Skip to main content
Package: apprentice-sdk. Import name: apprentice. This page covers the v0.1 surface. Two rules hold across the SDK:
  • Control-plane calls (create, upload, optimize, report) raise ApprenticeError on failure. The caller is a developer who needs to know.
  • Capture calls never raise into your app. They run in your request path and fail open.

Client

from apprentice import Apprentice

client = Apprentice(api_key="ap_live_...", base_url="https://your-backend")
ParameterTypeDefaultNotes
api_keystr | Noneenv APPRENTICE_API_KEYSent as a bearer token.
base_urlstr | Noneenv APPRENTICE_BASE_URLRequired. Raises ApprenticeError if missing.
timeoutfloat10.0Per-request timeout in seconds.
capture_asyncboolTrueAsync delivery off the caller thread. Set False for inline delivery in tests or short scripts.
client.close() flushes buffered traces and closes the HTTP client.

Tasks

A task is one repeatable LLM job. Create a separate task per job you want to improve.
client.tasks.create("invoice-json", metric="json_f1")
MethodParametersReturnsRaises
tasks.create(name, metric="auto")name: str, metric: strdict with the task, including a created flagApprenticeError on API failure
See Metrics for the accepted metric values.

Datasets

status = client.datasets.upload("invoice-json", path="golden.csv", prompt="...")
MethodParametersReturnsRaises
datasets.upload(task, path=, rows=, input_col="input", output_col="output", prompt=None)one of path or rowsDatasetStatusApprenticeError if both or neither of path/rows, or on API failure
datasets.status(task)task: strDatasetStatusApprenticeError on API failure
Provide exactly one of path (a CSV) or rows (a list of dicts). Rows use either of these shapes:
# legacy: a single input string
{"input": "raw text", "output": "expected output"}

# structured: the canonical shape, required for multi-field and RAG tasks
{"inputs": {"question": "...", "context": "..."}, "output": "expected output"}
For RAG, context must be exactly what the model saw. The optimizer cannot improve a prompt against context the model never received. Row tiers. Uploaded rows are silver (you curated them). Captured traces are raw (live, unverified). Human-verified rows are gold. Optimization uses verified rows, which is gold plus silver. Eval gates and model promotion use gold only. Silver can help you optimize, but it cannot certify quality. DatasetStatus fields: task, gold, silver, raw, ready_for_optimization.

Prompts

client.prompts.register("invoice-json", template)
best = client.prompts.get("invoice-json")
MethodParametersReturnsRaises
prompts.register(task, template)template: a dict artifact or a LangChain ChatPromptTemplatedictImportError if a LangChain object is passed without the [langchain] extra; ValueError for unsupported template formats; ApprenticeError on API failure
prompts.get(task, version=None)version: int | NonePromptVersionApprenticeError if no prompt or on API failure
prompts.history(task)task: strlist[PromptVersion]ApprenticeError on API failure
prompts.to_langchain(task=, artifact=)one of task or artifactLangChain ChatPromptTemplateApprenticeError if neither is given; ImportError without the [langchain] extra; ValueError for a bad artifact
The raw template artifact is a dict with format, messages, and input_variables. Only f-string templates are supported. PromptVersion fields: task, version, text, score.

Optimize jobs

report = client.optimize("invoice-json").wait().report()
MethodParametersReturnsRaises
optimize(task)task: strJobApprenticeError on API failure, including too few verified rows
job(job_id)job_id: strJobApprenticeError on API failure
Job.refresh()noneJobApprenticeError on API failure
Job.wait(poll_seconds=5.0, timeout_seconds=3600.0)floatsJobApprenticeError on timeout or API failure
Job.report()noneOptimizationReportApprenticeError if the job has no report
Job properties: job_id, status (queued, running, succeeded, failed). OptimizationReport fields: task, baseline_score, optimized_score, examples_used, optimized_prompt, optimized_template, detail. Scores can be None when the run did not produce a comparable number, so check before formatting.

Capture

The capture path logs one completed call. It never raises into your application.
trace_id = client.capture(
    "support-bot",
    output=answer,
    inputs={"question": question, "context": context},
    model="gpt-5-mini",
)
MethodParametersReturnsBehavior
capture(task, output, *, input=, inputs=, model=, latency_ms=, prompt_tokens=, completion_tokens=, error=, metadata=)task and output required; one of input or inputstrace_id or NoneNever raises. In async mode returns None and delivers off-thread; in sync mode returns the trace_id.
flush_captures(timeout=2.0)timeout: floatNoneBlocks until buffered traces are delivered. No-op in sync mode.
post_trace_failopen(record)record: TraceRecordtrace_id or NoneThe low-level escape hatch. Never raises.
For RAG, pass inputs={"question": ..., "context": exact_context}, not a single rendered prompt string. Do not wrap capture in a try/except to protect your app, it is already fail-open.

Feedback

client.feedback(trace_id, good=True, note="resolved the ticket")
MethodParametersReturnsRaises
feedback(trace_id, good=None, score=None, note=None)trace_id: str and at least one signalNoneApprenticeError on API failure

LangChain callback

from apprentice.langchain import ApprenticeCallback

model = init_chat_model("gpt-5-mini", callbacks=[ApprenticeCallback("support-bot", client)])
ConstructorParametersBehavior
ApprenticeCallback(task, client, redact=None)task: str, client: Apprentice, redact: Callable[[str], str] | NoneCaptures chat-model calls for the task. Records retrieved context for simple LangChain RAG chains. Never raises into your chain.

Metrics

metricScored byUse for
autoinferred from your rowslet the backend choose
json_f1deterministicJSON or structured extraction
text_f1deterministicshort free-text answers
semantic_f1LLM judge (estimate)free-text or RAG answers, the default for RAG
rag_faithfulnessLLM judge (estimate)whether every claim is supported by the context
rag_compositeLLM judge (estimate)RAG grounding and correct refusal together
Judge-scored metrics are advisory estimates, not deterministic truth. RAG rows auto-route to semantic_f1; pass metric="rag_composite" when you want grounding and refusal optimized together.

Errors and debugging

ApprenticeError is raised for control-plane failures, with a message that tells you what to do. Turn on request logging with:
import apprentice
apprentice.enable_debug_logging()   # or set APPRENTICE_DEBUG=1