> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runapprentice.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Python SDK reference

> Every public method in the apprentice package: parameters, return values, and what it raises.

Package: `apprentice-sdk`. Import name: `apprentice`. This page covers the v0.1 surface. Two rules hold across the SDK:

* **Control-plane calls** (create, upload, optimize, report) raise `ApprenticeError` on failure. The caller is a developer who needs to know.
* **Capture calls** never raise into your app. They run in your request path and fail open.

## Client

```python theme={null}
from apprentice import Apprentice

client = Apprentice(api_key="ap_live_...", base_url="https://your-backend")
```

| Parameter       | Type          | Default                   | Notes                                                                                            |
| --------------- | ------------- | ------------------------- | ------------------------------------------------------------------------------------------------ |
| `api_key`       | `str \| None` | env `APPRENTICE_API_KEY`  | Sent as a bearer token.                                                                          |
| `base_url`      | `str \| None` | env `APPRENTICE_BASE_URL` | Required. Raises `ApprenticeError` if missing.                                                   |
| `timeout`       | `float`       | `10.0`                    | Per-request timeout in seconds.                                                                  |
| `capture_async` | `bool`        | `True`                    | Async delivery off the caller thread. Set `False` for inline delivery in tests or short scripts. |

`client.close()` flushes buffered traces and closes the HTTP client.

## Tasks

A task is one repeatable LLM job. Create a separate task per job you want to improve.

```python theme={null}
client.tasks.create("invoice-json", metric="json_f1")
```

| Method                              | Parameters                 | Returns                                          | Raises                           |
| ----------------------------------- | -------------------------- | ------------------------------------------------ | -------------------------------- |
| `tasks.create(name, metric="auto")` | `name: str`, `metric: str` | `dict` with the task, including a `created` flag | `ApprenticeError` on API failure |

See [Metrics](#metrics) for the accepted `metric` values.

## Datasets

```python theme={null}
status = client.datasets.upload("invoice-json", path="golden.csv", prompt="...")
```

| Method                                                                                     | Parameters              | Returns         | Raises                                                                   |
| ------------------------------------------------------------------------------------------ | ----------------------- | --------------- | ------------------------------------------------------------------------ |
| `datasets.upload(task, path=, rows=, input_col="input", output_col="output", prompt=None)` | one of `path` or `rows` | `DatasetStatus` | `ApprenticeError` if both or neither of `path`/`rows`, or on API failure |
| `datasets.status(task)`                                                                    | `task: str`             | `DatasetStatus` | `ApprenticeError` on API failure                                         |

Provide exactly one of `path` (a CSV) or `rows` (a list of dicts). Rows use either of these shapes:

```python theme={null}
# legacy: a single input string
{"input": "raw text", "output": "expected output"}

# structured: the canonical shape, required for multi-field and RAG tasks
{"inputs": {"question": "...", "context": "..."}, "output": "expected output"}
```

For RAG, `context` must be exactly what the model saw. The optimizer cannot improve a prompt against context the model never received.

**Row tiers.** Uploaded rows are **silver** (you curated them). Captured traces are **raw** (live, unverified). Human-verified rows are **gold**. Optimization uses verified rows, which is gold plus silver. Eval gates and model promotion use gold only. Silver can help you optimize, but it cannot certify quality.

`DatasetStatus` fields: `task`, `gold`, `silver`, `raw`, `ready_for_optimization`.

## Prompts

```python theme={null}
client.prompts.register("invoice-json", template)
best = client.prompts.get("invoice-json")
```

| Method                                   | Parameters                                                      | Returns                        | Raises                                                                                                                                                         |
| ---------------------------------------- | --------------------------------------------------------------- | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `prompts.register(task, template)`       | `template`: a dict artifact or a LangChain `ChatPromptTemplate` | `dict`                         | `ImportError` if a LangChain object is passed without the `[langchain]` extra; `ValueError` for unsupported template formats; `ApprenticeError` on API failure |
| `prompts.get(task, version=None)`        | `version: int \| None`                                          | `PromptVersion`                | `ApprenticeError` if no prompt or on API failure                                                                                                               |
| `prompts.history(task)`                  | `task: str`                                                     | `list[PromptVersion]`          | `ApprenticeError` on API failure                                                                                                                               |
| `prompts.to_langchain(task=, artifact=)` | one of `task` or `artifact`                                     | LangChain `ChatPromptTemplate` | `ApprenticeError` if neither is given; `ImportError` without the `[langchain]` extra; `ValueError` for a bad artifact                                          |

The raw template artifact is a dict with `format`, `messages`, and `input_variables`. Only `f-string` templates are supported. `PromptVersion` fields: `task`, `version`, `text`, `score`.

## Optimize jobs

```python theme={null}
report = client.optimize("invoice-json").wait().report()
```

| Method                                               | Parameters    | Returns              | Raises                                                            |
| ---------------------------------------------------- | ------------- | -------------------- | ----------------------------------------------------------------- |
| `optimize(task)`                                     | `task: str`   | `Job`                | `ApprenticeError` on API failure, including too few verified rows |
| `job(job_id)`                                        | `job_id: str` | `Job`                | `ApprenticeError` on API failure                                  |
| `Job.refresh()`                                      | none          | `Job`                | `ApprenticeError` on API failure                                  |
| `Job.wait(poll_seconds=5.0, timeout_seconds=3600.0)` | floats        | `Job`                | `ApprenticeError` on timeout or API failure                       |
| `Job.report()`                                       | none          | `OptimizationReport` | `ApprenticeError` if the job has no report                        |

`Job` properties: `job_id`, `status` (`queued`, `running`, `succeeded`, `failed`).

`OptimizationReport` fields: `task`, `baseline_score`, `optimized_score`, `examples_used`, `optimized_prompt`, `optimized_template`, `detail`. Scores can be `None` when the run did not produce a comparable number, so check before formatting.

## Capture

The capture path logs one completed call. It never raises into your application.

```python theme={null}
trace_id = client.capture(
    "support-bot",
    output=answer,
    inputs={"question": question, "context": context},
    model="gpt-5-mini",
)
```

| Method                                                                                                                  | Parameters                                               | Returns              | Behavior                                                                                                 |
| ----------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------- |
| `capture(task, output, *, input=, inputs=, model=, latency_ms=, prompt_tokens=, completion_tokens=, error=, metadata=)` | `task` and `output` required; one of `input` or `inputs` | `trace_id` or `None` | Never raises. In async mode returns `None` and delivers off-thread; in sync mode returns the `trace_id`. |
| `flush_captures(timeout=2.0)`                                                                                           | `timeout: float`                                         | `None`               | Blocks until buffered traces are delivered. No-op in sync mode.                                          |
| `post_trace_failopen(record)`                                                                                           | `record: TraceRecord`                                    | `trace_id` or `None` | The low-level escape hatch. Never raises.                                                                |

For RAG, pass `inputs={"question": ..., "context": exact_context}`, not a single rendered prompt string. Do not wrap `capture` in a `try/except` to protect your app, it is already fail-open.

## Feedback

```python theme={null}
client.feedback(trace_id, good=True, note="resolved the ticket")
```

| Method                                                 | Parameters                              | Returns | Raises                           |
| ------------------------------------------------------ | --------------------------------------- | ------- | -------------------------------- |
| `feedback(trace_id, good=None, score=None, note=None)` | `trace_id: str` and at least one signal | `None`  | `ApprenticeError` on API failure |

## LangChain callback

```python theme={null}
from apprentice.langchain import ApprenticeCallback

model = init_chat_model("gpt-5-mini", callbacks=[ApprenticeCallback("support-bot", client)])
```

| Constructor                                     | Parameters                                                                | Behavior                                                                                                                         |
| ----------------------------------------------- | ------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `ApprenticeCallback(task, client, redact=None)` | `task: str`, `client: Apprentice`, `redact: Callable[[str], str] \| None` | Captures chat-model calls for the task. Records retrieved context for simple LangChain RAG chains. Never raises into your chain. |

## Metrics

| `metric`           | Scored by               | Use for                                         |
| ------------------ | ----------------------- | ----------------------------------------------- |
| `auto`             | inferred from your rows | let the backend choose                          |
| `json_f1`          | deterministic           | JSON or structured extraction                   |
| `text_f1`          | deterministic           | short free-text answers                         |
| `semantic_f1`      | LLM judge (estimate)    | free-text or RAG answers, the default for RAG   |
| `rag_faithfulness` | LLM judge (estimate)    | whether every claim is supported by the context |
| `rag_composite`    | LLM judge (estimate)    | RAG grounding and correct refusal together      |

Judge-scored metrics are advisory estimates, not deterministic truth. RAG rows auto-route to `semantic_f1`; pass `metric="rag_composite"` when you want grounding and refusal optimized together.

## Errors and debugging

`ApprenticeError` is raised for control-plane failures, with a message that tells you what to do. Turn on request logging with:

```python theme={null}
import apprentice
apprentice.enable_debug_logging()   # or set APPRENTICE_DEBUG=1
```