# Blazen — Full Documentation

Concatenation of every Blazen documentation page. Each entry below names the source URL so you can cite specific pages.

Companion index: https://blazen.dev/llms.txt

---

# Introduction

Source: https://blazen.dev/docs/getting-started/introduction
Section: getting-started

## What is Blazen?

Blazen is an event-driven AI workflow engine written in Rust with native Python and Node.js bindings. It gives you a structured, type-safe way to build complex LLM-powered workflows that can pause, resume, branch, and compose together into larger pipelines.

## Why Blazen?

Building production AI workflows means dealing with unreliable providers, long-running tasks, human review gates, and the need to orchestrate multiple models across a single pipeline. Blazen solves these problems at the engine level:

- **Type-safe events** -- Every piece of data flowing through your workflow is validated at compile time in Rust and at runtime in Python and Node.js.
- **15+ LLM providers** -- OpenAI, Anthropic, Google, Mistral, Groq, and more, with a unified interface and automatic fallback.
- **Multi-workflow pipelines** -- Chain workflows together with conditional routing, fan-out, and aggregation.
- **Pause and resume** -- Persist workflow state to disk or a database and pick up exactly where you left off.
- **Human-in-the-loop** -- Built-in support for approval gates, review steps, and manual intervention points.

## Architecture Overview

Blazen is built around three core abstractions. **Events** carry typed data between processing nodes. **Steps** consume events, perform work (calling an LLM, transforming data, waiting for human input), and emit new events. **Workflows** wire Steps together into a directed graph. When you need to go bigger, **Pipelines** compose multiple Workflows into an end-to-end system with shared context and cross-workflow routing.

## Core Crates

The engine is split into focused crates, each with a single responsibility:

- **blazen-events** -- Event definitions, serialization, and type registry
- **blazen-core** -- Workflow runtime, step execution, and scheduling
- **blazen-macros** -- Derive macros for events, steps, and configuration
- **blazen-llm** -- Unified LLM client with provider adapters and retry logic
- **blazen-pipeline** -- Multi-workflow orchestration and routing
- **blazen-persist** -- State serialization, checkpointing, and recovery
- **blazen-prompts** -- Prompt templating, variable injection, and versioning

## Polyglot by Design

Blazen exposes the same concepts across Rust, Python, and Node.js. Each language gets idiomatic APIs -- PyO3 bindings for Python, NAPI-RS for Node.js -- so you write natural code in your language of choice while the Rust engine handles the heavy lifting.

## License

Blazen is open source under the **AGPL-3.0** license.

---

# Installation

Source: https://blazen.dev/docs/getting-started/installation
Section: getting-started

## Rust

```bash
cargo add blazen
```

To enable specific provider features:

```bash
cargo add blazen --features openai,anthropic
```

## Python

Preferred (using [uv](https://docs.astral.sh/uv/)):
```bash
uv add blazen
```

Or with pip:
```bash
pip install blazen
```

## Node.js

```bash
pnpm add blazen
```

Or with npm/yarn:

```bash
npm install blazen
yarn add blazen
```

## WebAssembly

```bash
npm install @blazen/sdk
```

Or with pnpm/yarn:

```bash
pnpm add @blazen/sdk
yarn add @blazen/sdk
```

The WASM SDK runs in the browser, Node.js, Deno, and edge runtimes. See the [WASM Quickstart](/docs/guides/wasm/quickstart) for usage.

## CLI

```bash
cargo install blazen-cli
```

---

# Core Concepts

Source: https://blazen.dev/docs/getting-started/concepts
Section: getting-started

## Events

Events are the fundamental data units that flow between steps in a workflow. Blazen provides several built-in event types: `StartEvent` triggers a workflow, `StopEvent` terminates it with a result, and `InputRequestEvent`/`InputResponseEvent` enable human-in-the-loop interactions. You can also define custom events to carry domain-specific data between steps. Events fall into two categories: **routing events** that control flow between steps, and **stream events** that are observed externally by consumers without affecting the workflow's execution path.

`StopEvent(result=x)` preserves identity for non-JSON values. You can pass class instances, Pydantic models, and live DB connections through the result event and read them back unchanged on the other side (in the Python and WASM bindings).

## Steps

Steps are async functions that receive an event and a context, then return one or more events. Each step declares which event types it accepts, allowing the workflow router to dispatch events correctly. A step can return a single event to continue the flow, a list of events to fan out to multiple downstream steps, or null/None to perform a side-effect without emitting further events.

## Workflows

A workflow is a collection of steps wired together by an event router that dispatches events to the appropriate step based on type. Every workflow begins with a `StartEvent` and completes when a `StopEvent` is produced. Workflows support snapshotting, meaning they can be paused at any point and resumed later from the saved state.

## Context

The context is a shared key-value store available to all steps within a single workflow run. Use `ctx.set(key, value)` and `ctx.get(key)` to share state between steps without coupling them directly. Values can be structured data (strings, numbers, booleans, arrays, objects), raw binary data (`bytes` / `Uint8Array`), or platform-native objects -- each SDK handles serialization appropriate to its runtime. For raw binary data, use `ctx.set_bytes(key, data)` and `ctx.get_bytes(key)` -- these persist through pause/resume/checkpoint with no extra serialization. The context also provides `ctx.send_event()` for manually routing events to other steps and `ctx.write_event_to_stream()` for publishing events to external consumers.

`Context` exposes two explicit namespaces alongside the smart-routing `ctx.set` / `ctx.get` shortcuts:

- **`ctx.state`** -- persistable values. Survives `pause()` / `resume()` and checkpoint stores. Use this for anything that must outlive the current run (counters, input paths, intermediate JSON results).
- **`ctx.session`** -- live in-process references. Identity is preserved within a single workflow run -- `ctx.session["conn"]` returns the *same* Python object (or WASM JS value) on every access. Values in `ctx.session` are deliberately excluded from snapshots; what happens to them at pause time is governed by the workflow-level `session_pause_policy`.

The mental model: anything that **must** survive pause/resume goes in `ctx.state`; anything that **won't** survive (live DB connections, file handles, in-memory caches) goes in `ctx.session`. Identity preservation through `ctx.session` works in the Python and WASM bindings; in Node, session values are routed through JSON, so object identity is not preserved (a napi-rs threading constraint).

For structured state with mixed serializable and non-serializable fields, use `BlazenState` -- a protocol (or typed base class, depending on the SDK) that stores each field individually and provides a `restore()` hook to recreate transient fields (like database connections) after deserialization. Fields can be marked as transient to exclude them from storage, and custom persistence strategies can be assigned per field via `storeBy`. See the language-specific context guides for details.

## Streaming

Steps can publish non-routing events to an external stream via `ctx.write_event_to_stream()`. External consumers subscribe to this stream to receive updates in real time without interfering with the workflow's internal event routing. This is useful for progress reporting, logging, and delivering incremental results to a user interface.

## Pipelines

Pipelines compose multiple workflows into sequential or parallel stages, where each stage's output feeds into the next stage's input. This allows you to break complex processes into discrete, reusable workflows that can be tested and reasoned about independently. Pipelines handle the orchestration of data flow between stages automatically.

---

# Quickstart

Source: https://blazen.dev/docs/guides/node/quickstart
Language: node
Section: guides

# Quickstart

Get a Blazen workflow running in Node.js in under five minutes.

## Installation

```bash
pnpm add blazen
```

## Your first workflow

Create a file called `greeter.ts`:

```typescript
import { Workflow } from "blazen";

const wf = new Workflow("greeter");

wf.addStep("parse_input", ["blazen::StartEvent"], async (event, ctx) => {
  const name = event.name || "World";
  return { type: "GreetEvent", name };
});

wf.addStep("greet", ["GreetEvent"], async (event, ctx) => {
  return {
    type: "blazen::StopEvent",
    result: { greeting: `Hello, ${event.name}!` },
  };
});

const result = await wf.run({ name: "Blazen" });
console.log(result.data);
// => { greeting: "Hello, Blazen!" }
```

Run it:

```bash
npx tsx greeter.ts
```

## How it works

**Events** are plain objects with a `type` field. Blazen provides two built-in event types:

- `"blazen::StartEvent"` -- emitted when the workflow begins. The object you pass to `wf.run()` is merged onto this event.
- `"blazen::StopEvent"` -- returning an object with this type ends the workflow. Attach your final output to the `result` property.

**Steps** are registered with `addStep(name, eventTypes, handler)`:

- `name` -- a unique identifier for the step.
- `eventTypes` -- an array of event type strings that trigger this step.
- `handler` -- an async function receiving `(event, ctx)`. Return an event object to emit the next event in the workflow.

**Context** lets steps share state across the workflow. Both `ctx.set()` and `ctx.get()` are async and must be awaited. `ctx.get()` returns all value types -- strings, numbers, booleans, arrays, objects, and binary data -- nothing is silently dropped:

```typescript
wf.addStep("store_value", ["blazen::StartEvent"], async (event, ctx) => {
  await ctx.set("user", event.name);
  return { type: "NextEvent" };
});

wf.addStep("read_value", ["NextEvent"], async (event, ctx) => {
  const user = await ctx.get("user"); // StateValue | null
  return {
    type: "blazen::StopEvent",
    result: { user },
  };
});
```

**Result** -- `wf.run()` resolves with an object containing `.type` (always `"blazen::StopEvent"`) and `.data` (the `result` you returned from the final step).

## Next steps

You now have a working workflow. From here you can:

- Chain more steps together to build complex pipelines.
- Use context to pass data between non-adjacent steps.
- Integrate LLM calls, database queries, or any async operation inside step handlers.

---

# Quickstart

Source: https://blazen.dev/docs/guides/python/quickstart
Language: python
Section: guides

## Installation

Preferred (using [uv](https://docs.astral.sh/uv/)):
```bash
uv add blazen
```

Or with pip:
```bash
pip install blazen
```

## Define Your Events

Blazen routes work through **typed events**. Subclass `Event` to declare the data each step expects:

```python
from blazen import Event

class GreetEvent(Event):
    name: str
```

Subclassing `Event` automatically sets `event_type` to the class name (`"GreetEvent"`). Fields are declared as plain type annotations -- Blazen uses them for validation and attribute access.

## Define Your Steps

Create a simple greeter workflow with two steps: one to parse input and one to produce a greeting.

```python
import asyncio
from blazen import Workflow, step, Event, StopEvent, Context

class GreetEvent(Event):
    name: str

@step
async def parse_input(ctx: Context, ev: Event):
    return GreetEvent(name=ev.name or "World")

@step
async def greet(ctx: Context, ev: GreetEvent):
    return StopEvent(result={"greeting": f"Hello, {ev.name}!"})
```

The `@step` decorator reads the **type hint** on the `ev` parameter to decide which events a step receives. `parse_input` accepts the base `Event` (so it handles the initial start event), while `greet` accepts `GreetEvent` specifically. No `accepts=` argument is needed.

Because `GreetEvent` has a typed `name` field, you access it directly as `ev.name` -- no `.to_dict()` unpacking required.

## Build and Run

```python
async def main():
    wf = Workflow("greeter", [parse_input, greet])
    handler = await wf.run(name="Blazen")
    result = await handler.result()
    print(result.result)

asyncio.run(main())
```

Running this prints:

```
{'greeting': 'Hello, Blazen!'}
```

## Using Context

Steps can share state through the `Context` object. Context access is **synchronous** -- no `await` needed:

```python
@step
async def parse_input(ctx: Context, ev: Event):
    ctx.set("request_count", ctx.get("request_count", 0) + 1)
    return GreetEvent(name=ev.name or "World")

@step
async def greet(ctx: Context, ev: GreetEvent):
    count = ctx.get("request_count", 0)
    return StopEvent(result={"greeting": f"Hello, {ev.name}!", "request_number": count})
```

`ctx.set(key, value)` stores a value and `ctx.get(key)` (or `ctx.get(key, default)`) retrieves it. Both are plain synchronous calls you can use anywhere inside a step.

`ctx.set()` accepts any Python value -- not just JSON-serializable types. Bytes are stored as raw binary, Pydantic models and other complex objects are pickled automatically, and unpicklable objects (DB connections, file handles, sockets) are kept as live in-process references. `ctx.get()` returns the original type transparently. For new code, prefer the explicit `ctx.state` and `ctx.session` namespaces:

```python
ctx.state.set("counter", 5)                          # persistable JSON
ctx.session.set("db", sqlite3.connect(":memory:"))   # live object, identity-preserving
```

See the [Context guide](./context/) for details on the four storage tiers and the `state` vs `session` split.

## Key Concepts

- **Event subclasses** -- `class MyEvent(Event): ...` gives you typed fields, automatic `event_type` naming, and direct attribute access (`ev.field`).
- **Type-hint routing** -- `@step` inspects the `ev` parameter's type hint to route events. A hint of `Event` receives the start event; a hint of `GreetEvent` receives only `GreetEvent` instances.
- **Context is synchronous** -- `ctx.set("key", value)` and `ctx.get("key")` do not require `await`.
- **Alternative syntax** -- `Event("GreetEvent", name=value)` still works for inline, one-off events when you don't need a dedicated class.

---

# Quickstart

Source: https://blazen.dev/docs/guides/rust/quickstart
Language: rust
Section: guides

## Install

Add Blazen to your project:

```bash
cargo add blazen
```

You will also need `tokio`, `serde`, `serde_json`, and `anyhow`:

```bash
cargo add tokio --features full
cargo add serde --features derive
cargo add serde_json anyhow
```

## Define a Custom Event

Events are the data that flows between steps. Derive `Event` alongside `Serialize` and `Deserialize` to create your own:

```rust
use blazen::prelude::*;

#[derive(Debug, Clone, Serialize, Deserialize, Event)]
struct GreetEvent {
    name: String,
}
```

## Define Steps

Each step is an async function annotated with `#[step]`. It takes an event and a context, then returns the next event in the chain:

```rust
#[step]
async fn parse_input(event: StartEvent, _ctx: Context) -> Result<GreetEvent, WorkflowError> {
    let name = event.data["name"].as_str().unwrap_or("World").to_string();
    Ok(GreetEvent { name })
}

#[step]
async fn greet(event: GreetEvent, _ctx: Context) -> Result<StopEvent, WorkflowError> {
    Ok(StopEvent {
        result: serde_json::json!({ "greeting": format!("Hello, {}!", event.name) }),
    })
}
```

## Build and Run the Workflow

Wire the steps together with `WorkflowBuilder`. The `#[step]` macro generates a `_registration()` function for each step that you pass to the builder:

```rust
#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let workflow = WorkflowBuilder::new("greeter")
        .step(parse_input_registration())
        .step(greet_registration())
        .build()?;

    let result = workflow.run(serde_json::json!({ "name": "Blazen" })).await?.result().await?;
    println!("{}", result.event.to_json());
    Ok(())
}
```

## How It Works

The workflow follows a linear event chain:

1. **`StartEvent`** -- the workflow begins with the JSON payload you pass to `run()`.
2. **`GreetEvent`** -- `parse_input` extracts the name and emits a `GreetEvent`.
3. **`StopEvent`** -- `greet` produces the final greeting and terminates the workflow.

The event router automatically dispatches each event to the step that accepts it. No manual wiring between individual steps is required -- the types handle the routing.

---

# WASM Quickstart

Source: https://blazen.dev/docs/guides/wasm/quickstart
Language: wasm
Section: guides

## What is blazen-wasm-sdk?

The Blazen WebAssembly SDK is the core Rust library compiled to WASM. It runs the same completion, streaming, agent, and workflow logic as the native Rust crate -- directly in the browser or in any JavaScript runtime that supports WebAssembly (Node.js, Deno, Bun, Cloudflare Workers).

The package is published as `@blazen/sdk` on npm.

## Installation

```bash
npm install @blazen/sdk
```

Or with pnpm/yarn:

```bash
pnpm add @blazen/sdk
yarn add @blazen/sdk
```

## Basic usage

Initialize the WASM module, then use `CompletionModel` and `ChatMessage` the same way you would in the Node.js SDK:

```typescript
import init, { CompletionModel, ChatMessage } from '@blazen/sdk';

await init();

// The WASM SDK reads API keys from the runtime environment only
// (OPENROUTER_API_KEY, OPENAI_API_KEY, etc.). No arguments are accepted.
const model = CompletionModel.openrouter();
const response = await model.complete([ChatMessage.user('Hello!')]);
console.log(response.content);
```

`init()` loads and instantiates the WASM binary. Call it once before using any other export. In most bundlers (Vite, webpack 5, etc.) the binary is resolved automatically.

## Streaming

Stream tokens as they arrive using `model.stream()`:

```typescript
import init, { CompletionModel, ChatMessage } from '@blazen/sdk';

await init();

const model = CompletionModel.openai(); // reads OPENAI_API_KEY from env

await model.stream(
  [ChatMessage.user('Write a haiku about WebAssembly')],
  (chunk) => {
    if (chunk.delta) {
      document.getElementById('output').textContent += chunk.delta;
    }
  }
);
```

Each chunk has the shape `{ delta?: string, finishReason?: string, toolCalls: ToolCall[] }`.

## Completion options

Pass options like `temperature` and `maxTokens` through `completeWithOptions()`:

```typescript
const response = await model.completeWithOptions(
  [ChatMessage.user('Explain WASM in one sentence.')],
  { temperature: 0.3, maxTokens: 100 }
);
```

## API key security

WASM runs client-side in the browser. Never embed raw API keys in frontend code shipped to users. Common strategies:

- **Proxy server** -- route requests through your own backend that injects the key.
- **Short-lived tokens** -- issue scoped, time-limited tokens from your server.
- **Server-side only** -- run the WASM SDK in Node.js / Deno / edge functions where keys stay secret.

## Next steps

- Build multi-step pipelines with [WASM Workflows](/docs/guides/wasm/workflows).
- Add tool calling with the [WASM Agent](/docs/guides/wasm/agent).
- Ship to production with the [Edge Deployment](/docs/guides/wasm/deployment) guide.

---

# Events

Source: https://blazen.dev/docs/guides/node/events
Language: node
Section: guides

## Custom Events

Events are plain JavaScript objects with a `type` field:

```javascript
const event = { type: "AnalyzeEvent", text: "hello", score: 0.9 };
```

No classes needed — any object with `type` is an event.

## Built-in Events

- `"blazen::StartEvent"` — the input event, carries your workflow input as properties
- `"blazen::StopEvent"` — terminates the workflow, must have a `result` field

```javascript
// Start event (created by wf.run()):
{ type: "blazen::StartEvent", message: "hello" }

// Stop event (returned by a step):
{ type: "blazen::StopEvent", result: { answer: 42 } }
```

## Event Routing

Steps declare which event types they handle in the `addStep` call:

```javascript
wf.addStep("first", ["blazen::StartEvent"], async (event, ctx) => {
  return { type: "AnalyzeEvent", text: event.message };
});

wf.addStep("second", ["AnalyzeEvent"], async (event, ctx) => {
  return { type: "blazen::StopEvent", result: { text: event.text } };
});
```

## Fan-out (Multiple Events)

Return an array to dispatch multiple events:

```javascript
wf.addStep("fan", ["blazen::StartEvent"], async (event, ctx) => {
  return [
    { type: "BranchA", value: "a" },
    { type: "BranchB", value: "b" },
  ];
});
```

## Side-Effect Steps

Return `null` and use `ctx.sendEvent()` for manual routing:

```javascript
wf.addStep("side_effect", ["blazen::StartEvent"], async (event, ctx) => {
  await ctx.set("processed", true);
  await ctx.sendEvent({ type: "Continue" });
  return null;
});
```

---

# Events

Source: https://blazen.dev/docs/guides/python/events
Language: python
Section: guides

## Defining Custom Events

Subclass `Event` to define typed, self-documenting events. The `event_type` is
automatically set to the class name, and fields are type-annotated for clarity
but stored as JSON internally.

```python
from blazen import Event

class AnalyzeEvent(Event):
    text: str
    score: float

ev = AnalyzeEvent(text="hello", score=0.9)
print(ev.event_type)  # "AnalyzeEvent"
print(ev.text)        # "hello"
print(ev.to_dict())   # {"text": "hello", "score": 0.9}
```

## Built-in Events

`StartEvent` and `StopEvent` are provided by the framework. Every workflow
begins with a `StartEvent` and terminates when a `StopEvent` is returned.

```python
from blazen import StartEvent, StopEvent

start = StartEvent(message="hello")      # event_type: "blazen::StartEvent"
stop = StopEvent(result={"answer": 42})  # event_type: "blazen::StopEvent"
```

## Event Routing via Type Hints

The type hint on the `ev` parameter controls which events a step accepts.
No explicit `accepts` list is needed.

- **`ev: Event`** (or no type hint) -- the step handles `StartEvent` by default.
- **`ev: SomeCustomEvent`** -- the step automatically accepts only `SomeCustomEvent`.

```python
from blazen import Event, StartEvent, StopEvent, Context, step

class AnalyzeEvent(Event):
    text: str

@step
async def first(ctx: Context, ev: Event):
    # ev: Event -> accepts StartEvent
    return AnalyzeEvent(text=ev.message)

@step
async def second(ctx: Context, ev: AnalyzeEvent):
    # ev: AnalyzeEvent -> automatically accepts=["AnalyzeEvent"]
    return StopEvent(result={"text": ev.text})
```

## Fan-out (Multiple Events)

Return a list to dispatch multiple events simultaneously. Each event routes to
its matching step independently.

```python
from blazen import Event, Context, step

class BranchA(Event):
    value: str

class BranchB(Event):
    value: str

@step
async def fan_out(ctx: Context, ev: Event):
    return [BranchA(value="a"), BranchB(value="b")]
```

## Side-Effect Steps

Return `None` and use `ctx.send_event()` when a step needs to perform a
side effect without directly producing a next event in the return value.

```python
from blazen import Event, Context, step

class ContinueEvent(Event):
    pass

@step
async def side_effect(ctx: Context, ev: Event):
    ctx.set("processed", True)
    ctx.send_event(ContinueEvent())
    return None
```

## Alternative: Inline Events

For quick prototyping you can still create events inline without a subclass:

```python
ev = Event("AnalyzeEvent", text="hello", score=0.9)
```

This is equivalent to defining a one-off event class. Prefer subclasses for
production code since they give you type safety and self-documenting fields.

---

# Events

Source: https://blazen.dev/docs/guides/rust/events
Language: rust
Section: guides

## Defining Custom Events

```rust
use blazen::prelude::*;

#[derive(Debug, Clone, Serialize, Deserialize, Event)]
struct AnalyzeEvent {
    text: String,
    score: f64,
}
```

The `#[derive(Event)]` macro generates `event_type()`, serialization, and routing support.

## Built-in Events

- `StartEvent` — carries `data: serde_json::Value`, triggers workflow start
- `StopEvent` — carries `result: serde_json::Value`, terminates workflow
- `InputRequestEvent` — requests human input, auto-pauses workflow
- `InputResponseEvent` — carries human response, matched by `request_id`

## Event Flow

Steps declare their input/output types:

```rust
#[step]
async fn analyze(event: StartEvent, _ctx: Context) -> Result<AnalyzeEvent, WorkflowError> {
    Ok(AnalyzeEvent {
        text: event.data["text"].as_str().unwrap().into(),
        score: 0.95,
    })
}
```

The router matches event types to steps automatically.

## Multiple Output Events

Use `#[step(emits = [...])]` and `StepOutput`:

```rust
#[step(emits = [PositiveEvent, NegativeEvent])]
async fn route(event: AnalyzeEvent, _ctx: Context) -> Result<StepOutput, WorkflowError> {
    if event.score > 0.5 {
        Ok(StepOutput::Single(Box::new(PositiveEvent { text: event.text })))
    } else {
        Ok(StepOutput::Single(Box::new(NegativeEvent { text: event.text })))
    }
}
```

---

# WASM Workflows

Source: https://blazen.dev/docs/guides/wasm/workflows
Language: wasm
Section: guides

## Overview

Blazen workflows run natively inside the WASM module. Steps, events, and the context store all execute locally -- no server round-trips for orchestration. Step handlers are plain JavaScript async functions that the WASM runtime calls back into.

## Creating a workflow

```typescript
import init, { Workflow, ChatMessage, CompletionModel } from '@blazen/sdk';

await init();

const wf = new Workflow('summarizer');

wf.addStep('fetch_text', ['blazen::StartEvent'], async (event, ctx) => {
  ctx.set('source', event.url);
  const text = await fetch(event.url).then(r => r.text());
  return { type: 'SummarizeEvent', text };
});

wf.addStep('summarize', ['SummarizeEvent'], async (event, ctx) => {
  // The WASM SDK reads OPENROUTER_API_KEY from the runtime environment.
  // Factory methods do not accept runtime keys -- configure the env var
  // at deploy time (see the Edge Deployment guide for strategies).
  const model = CompletionModel.openrouter();
  const response = await model.complete([
    ChatMessage.system('Summarize the following text in 2-3 sentences.'),
    ChatMessage.user(event.text),
  ]);
  return {
    type: 'blazen::StopEvent',
    result: { summary: response.content },
  };
});

const result = await wf.run({ url: 'https://example.com/article' });
console.log(result.data.summary);
```

## Event-driven architecture

Events are plain objects with a `type` field. The WASM event router dispatches each event to the step whose `eventTypes` list includes that type -- identical to the Rust and Node.js SDKs.

Built-in event types:

- `"blazen::StartEvent"` -- emitted when the workflow begins. The object passed to `wf.run()` is merged onto it.
- `"blazen::StopEvent"` -- returning this from a step ends the workflow. Attach your output to the `result` property.

## Context

The `ctx` parameter is a `WasmContext` instance -- a real object with methods for sharing state, emitting events, and inspecting the current run. All methods are **synchronous** (unlike the Node.js SDK, which is async).

### Storing and retrieving values

```typescript
wf.addStep('store', ['blazen::StartEvent'], async (event, ctx) => {
  ctx.set('user', event.name);
  ctx.set('count', 42);
  ctx.set('tags', ['intro', 'demo']);
  return { type: 'NextEvent' };
});

wf.addStep('read', ['NextEvent'], async (event, ctx) => {
  const user = ctx.get('user');   // 'Alice'
  const count = ctx.get('count'); // 42
  const missing = ctx.get('nope'); // null
  return { type: 'blazen::StopEvent', result: { user, count } };
});
```

`ctx.set(key, value)` auto-detects `Uint8Array` values and stores them as binary. Everything else is stored as-is. `ctx.get(key)` returns the original value, or `null` if the key is missing.

Values can be any `StateValue`:

```typescript
type StateValue = string | number | boolean | null | Uint8Array | StateValue[] | { [key: string]: StateValue };
```

### State vs Session namespaces

`Context` exposes two explicit namespaces alongside the legacy smart-routing `ctx.set` / `ctx.get`:

- **`ctx.state`** -- persistable values. Routes through the same dispatch as `ctx.set`, so anything that survives the `StateValue` round-trip belongs here.
- **`ctx.session`** -- live in-process JS references. **Identity IS preserved** within a run.

```typescript
wf.addStep('share_live', ['blazen::StartEvent'], (event, ctx) => {
  ctx.state.set('counter', 5);

  const liveObj = { tag: 'live', count: 0 };
  ctx.session.set('shared', liveObj);
  console.log(ctx.session.get('shared') === liveObj); // true -- identity preserved
  liveObj.count += 1;
  console.log(ctx.session.get('shared').count);       // 1 -- same object reference

  return { type: 'blazen::StopEvent', result: {} };
});
```

Because the WASM runtime is single-threaded, session values are stored as raw `JsValue` in a dedicated map. This is a key differentiator from the Node bindings, where session values are routed through `serde_json::Value` and identity is NOT preserved (due to napi-rs threading constraints).

Session values are deliberately excluded from any snapshot -- WASM does not currently support cross-process snapshot/resume of session entries. Use `ctx.state` for anything that needs to survive a snapshot, and `ctx.session` for live handles (open sockets, in-flight caches, framework instances) that only make sense within the current run.

All namespace methods are **synchronous**, just like the rest of the WASM `Context` API.

### Binary data

For explicit binary storage, use `ctx.setBytes()` and `ctx.getBytes()`:

```typescript
wf.addStep('binary', ['blazen::StartEvent'], async (event, ctx) => {
  const data = new Uint8Array([0x48, 0x65, 0x6c, 0x6c, 0x6f]);

  // Either approach works for storing binary:
  ctx.set('via_set', data);          // auto-detected as binary
  ctx.setBytes('via_explicit', data); // explicit binary storage

  // Both return Uint8Array on read:
  const a = ctx.get('via_set');          // Uint8Array
  const b = ctx.getBytes('via_explicit'); // Uint8Array | null

  return { type: 'blazen::StopEvent', result: { ok: true } };
});
```

### Emitting events

`ctx.sendEvent(event)` queues an event into the workflow's event loop, allowing a step to trigger other steps mid-execution:

```typescript
wf.addStep('kickoff', ['blazen::StartEvent'], async (event, ctx) => {
  ctx.sendEvent({ type: 'SideTask', payload: 'extra work' });
  return { type: 'MainTask' };
});
```

### Run ID

Each workflow run is assigned a unique UUID v4. Access it with `ctx.runId()`:

```typescript
wf.addStep('log', ['blazen::StartEvent'], async (event, ctx) => {
  console.log('Run:', ctx.runId());
  return { type: 'blazen::StopEvent', result: {} };
});
```

### Workflow name

The workflow name is available as a getter property:

```typescript
console.log(ctx.workflowName); // 'summarizer'
```

### Method summary

| Method | Return type | Description |
|---|---|---|
| `ctx.set(key, value)` | `void` | Store a value; auto-detects `Uint8Array` for binary |
| `ctx.get(key)` | `StateValue \| null` | Retrieve a value, or `null` if missing |
| `ctx.setBytes(key, data)` | `void` | Explicitly store binary data (`Uint8Array`) |
| `ctx.getBytes(key)` | `Uint8Array \| null` | Retrieve binary data |
| `ctx.sendEvent(event)` | `void` | Queue an event into the workflow event loop |
| `ctx.writeEventToStream(event)` | `void` | No-op in WASM (present for API compatibility) |
| `ctx.runId()` | `string` | Unique UUID v4 for the current run |
| `ctx.workflowName` | `string` | Getter property for the workflow name |

### BlazenState

For structured state that mixes serializable fields with non-serializable ones (database connections, caches, handles), the WASM SDK supports a **BlazenState protocol**. Any plain object with the `__blazen_state__: true` marker is automatically decomposed by `ctx.set()` and reconstructed by `ctx.get()` -- no explicit `saveTo()`/`loadFrom()` calls needed.

The protocol reads metadata from the object's **constructor** via a static `meta` property:

```typescript
class SessionState {
  url = '';
  token = '';
  conn = null;       // non-serializable -- will be recreated
  requestCount = 0;

  static meta = {
    transient: ['conn'],                // excluded from storage
    storeBy: {},                        // custom FieldStore per field (optional)
    restore: 'reconnect',               // method name called after reconstruction
  };

  reconnect() {
    if (this.url && this.token) {
      this.conn = createConnection(this.url, this.token);
    }
  }
}
```

Mark instances with `__blazen_state__: true` before storing:

```typescript
wf.addStep('init', ['blazen::StartEvent'], async (event, ctx) => {
  const state = new SessionState();
  state.url = event.url;
  state.token = event.token;
  state.__blazen_state__ = true;

  ctx.set('session', state);            // auto-decomposes per field
  return { type: 'ProcessEvent' };
});

wf.addStep('process', ['ProcessEvent'], async (event, ctx) => {
  const state = ctx.get('session');     // auto-reconstructs, calls reconnect()
  console.log(state.url);              // preserved
  console.log(state.conn);            // recreated by reconnect()
  return { type: 'blazen::StopEvent', result: { requests: state.requestCount } };
});
```

How it works:

- **`ctx.set(key, state)`** -- iterates `Object.keys(state)`, skips `__blazen_state__` and any field listed in `meta.transient`, then stores each remaining field at `{key}.{fieldName}`. Fields with a custom `FieldStore` in `meta.storeBy` are persisted via `store.save(fieldKey, value, ctx)` instead of the default path. A metadata entry at `{key}.__blazen_meta__` records the field list, class name, and configuration.
- **`ctx.get(key)`** -- checks for a `{key}.__blazen_meta__` entry. If found, it loads each field individually (or via the custom `FieldStore.load()`), reassembles the object, sets the `__blazen_state__` marker, and calls the `restore` method (if specified in the metadata).

All operations are **synchronous** -- the `FieldStore.save()` and `FieldStore.load()` callbacks in WASM must return values directly (no `Promise`s).

The `storeBy` option lets you route specific fields through a custom persistence strategy:

```typescript
const externalCache = {
  save(key, value, ctx) { localStorage.setItem(key, JSON.stringify(value)); },
  load(key, ctx) { return JSON.parse(localStorage.getItem(key) ?? 'null'); },
};

class AppState {
  preferences = {};
  activeTab = 'home';

  static meta = {
    storeBy: { preferences: externalCache },
  };
}
```

## Streaming events

In the WASM SDK, `ctx.writeEventToStream()` is a **no-op** -- it exists for API compatibility with the Node.js and Rust SDKs but does not emit events to an external stream. You can still use `Workflow.runStreaming()` to receive events routed via `ctx.sendEvent()`:

```typescript
wf.addStep('process', ['blazen::StartEvent'], async (event, ctx) => {
  for (let i = 0; i < 5; i++) {
    ctx.sendEvent({ type: 'Progress', step: i });
  }
  return { type: 'blazen::StopEvent', result: { done: true } };
});

// Callback-style streaming: invoke the callback for every event the
// workflow emits. The promise resolves with the final StopEvent payload.
const result = await wf.runStreaming({}, (event) => {
  console.log(event.eventType, event);
});
```

`Workflow.runStreaming(input, callback)` is the callback-driven counterpart to the handler-based API below. The callback fires synchronously from the WASM event loop for each emitted event (start, custom, stop, input requests, errors). Use it when you do not need fine-grained control -- only observation.

## Handler-based execution

For workflows that need pause/resume, snapshotting, human-in-the-loop input, or external cancellation, use `Workflow.runWithHandler(input)`. It returns a `WorkflowHandler` immediately while the workflow runs in the background:

```typescript
const handler = await wf.runWithHandler({ url: 'https://example.com' });

// Stream events through the handler instead of waiting for a final result.
await handler.streamEvents((ev) => {
  console.log(ev.eventType);
});
```

### Human-in-the-loop input

Steps can request input from the host by emitting an `InputRequested` event. The handler's `respondToInput(requestId, response)` method delivers the answer back to the paused step:

```typescript
const handler = await wf.runWithHandler(input);

await handler.streamEvents((ev) => {
  if (ev.eventType === 'InputRequested') {
    // Reply to the specific request by id. The step's awaiter resumes
    // with the supplied response value.
    handler.respondToInput(ev.requestId, 'answer');
  }
});
```

### Snapshot, pause, and resume

`handler.snapshot()` captures the current workflow state (queued events, context state, step progress) without halting execution. To pause and later resume in the same process, call `handler.resumeInPlace()` after the workflow has paused:

```typescript
const handler = await wf.runWithHandler(input);

// Capture state mid-run without affecting execution.
const snap = await handler.snapshot();

// ...later, after the workflow has paused itself...
await handler.resumeInPlace();
```

To cancel a run outright, call `handler.abort()`. Any in-flight step handlers will see the cancellation propagate as the next event-loop tick.

### WorkflowHandler method summary

| Method | Description |
|---|---|
| `handler.streamEvents(callback)` | Callback-style event stream covering all emitted events |
| `handler.respondToInput(requestId, response)` | Answer a HITL `InputRequested` event |
| `handler.snapshot()` | Capture serializable workflow state without pausing |
| `handler.resumeInPlace()` | Resume a paused workflow in the current process |
| `handler.abort()` | Cancel the running workflow |

## Session pause policy

By default, a workflow's `ctx.session` map is cleared whenever the workflow pauses, since live JS references generally cannot survive a snapshot/resume boundary. If your steps store handles that remain valid across pauses (e.g. a long-lived database client owned by the host), opt into preservation with the session pause policy:

```typescript
// On the builder before constructing the workflow:
const workflow = builder.setSessionPausePolicy('pause').build();

// Or directly on an already-built workflow:
wf.setSessionPausePolicy('continue');
```

Accepted values:

- `"continue"` (default) -- session-ref state is dropped on pause. Safe; matches snapshot semantics.
- `"pause"` -- session-ref state is preserved across pauses, surviving until the workflow resumes in the same process.

Both `Workflow.setSessionPausePolicy(policy)` and `WorkflowBuilder.setSessionPausePolicy(policy)` accept the same policy strings, so you can configure the behavior at whichever construction stage is convenient.

## Resuming from a serialized snapshot

`handler.snapshot()` returns a structure that can be persisted (e.g. to KV, IndexedDB, or a Durable Object). To bring it back to life in a fresh process, rebuild the workflow with `Workflow.fromBuilder(...)` and call `resumeWithSerializableRefs(snapshot, deserializers)`:

```typescript
// Persist the snapshot somewhere durable.
const snap = await handler.snapshot();
await kv.put('wf-snap', JSON.stringify(snap));

// Later, in a new process or worker invocation:
const restored = await Workflow.fromBuilder(builder).resumeWithSerializableRefs(
  snap,
  {
    // Map session-ref type names to their deserializers. Each callback
    // receives the bytes that were captured at snapshot time and must
    // return the live JS value to reinstall into ctx.session.
    MyType: (bytes) => myDeserialize(bytes),
  },
);
```

The `deserializers` map is keyed by the type tag recorded for each session entry at snapshot time. Any session-ref types you have registered as serializable must have a matching entry in the map -- unknown types are skipped and logged.

## Branching

Return an array of events to fan out to multiple steps:

```typescript
wf.addStep('classify', ['blazen::StartEvent'], async (event, ctx) => {
  return [
    { type: 'PositiveEvent', text: event.text },
    { type: 'NegativeEvent', text: event.text },
  ];
});
```

## Timeouts

Set a maximum execution time with `setTimeout()`:

```typescript
wf.setTimeout(30); // 30 seconds
```

## Next steps

- Add tool-calling agents to your workflows with the [WASM Agent](/docs/guides/wasm/agent) guide.
- Deploy workflows to the edge with the [Edge Deployment](/docs/guides/wasm/deployment) guide.

---

# Streaming

Source: https://blazen.dev/docs/guides/node/streaming
Language: node
Section: guides

## Stream vs Routing Events

Routing events flow between steps — they drive the workflow forward. Stream events are published for external observation without affecting the workflow graph.

## Publishing Stream Events

Use `ctx.writeEventToStream()` inside a step to emit events to external consumers:

```javascript
wf.addStep("process", ["blazen::StartEvent"], async (event, ctx) => {
  for (let i = 0; i < 3; i++) {
    await ctx.writeEventToStream({ type: "Progress", step: i, message: `Processing step ${i}` });
  }
  return { type: "blazen::StopEvent", result: { done: true } };
});
```

`ctx.writeEventToStream()` is async — always `await` it.

## Consuming Stream Events

Call `wf.runStreaming()` instead of `wf.run()` to receive streamed events via a callback:

```javascript
const collected = [];
const result = await wf.runStreaming({}, (event) => {
  console.log(`[${event.type}] step=${event.step}`);
  collected.push(event);
});
console.log("Final result:", result.data);
console.log("Streamed events:", collected.length);
```

`wf.runStreaming(input, callback)` calls the callback for each streamed event and resolves with the final workflow result.

## Handler-Based Event Streaming

When you need control over a running workflow (pause, snapshot, abort) alongside event observation, use `runWithHandler()` and subscribe via `streamEvents()`:

```typescript
const handler = await wf.runWithHandler({ message: "hello" });

await handler.streamEvents((event) => {
  console.log(`[${event.type}]`, event);
});

const result = await handler.result();
```

`streamEvents(onEvent)` resolves once the underlying broadcast subscription is wired up; the callback continues to fire for each event published by `ctx.writeEventToStream()` until the workflow finishes. Subscribe **before** calling `result()` — events emitted before the subscription is attached are not replayed.

## Typed Local Inference Streams

Local-inference providers expose typed chunk streams in addition to the callback-style `stream()` method. The wrappers are `InferenceChunkStream` (mistral.rs) and `LlamaCppInferenceChunkStream` (llama.cpp), and each yields strongly-typed `InferenceChunk` / `LlamaCppInferenceChunk` values you pull with `await stream.next()`.

```typescript
import type { InferenceChunkStream } from "blazen";

const stream: InferenceChunkStream = await provider.inferStream(messages);
while (true) {
  const chunk = await stream.next();
  if (chunk === null) break;
  if (chunk.delta) process.stdout.write(chunk.delta);
  if (chunk.reasoningDelta) process.stderr.write(chunk.reasoningDelta);
  if (chunk.finishReason) console.log(`\n[done: ${chunk.finishReason}]`);
}
```

`LlamaCppInferenceChunkStream` follows the same shape with a narrower `LlamaCppInferenceChunk` payload (`delta`, `finishReason`):

```typescript
for (let chunk = await stream.next(); chunk !== null; chunk = await stream.next()) {
  process.stdout.write(chunk.delta ?? "");
}
```

`stream.next()` resolves to `null` once generation is exhausted; engine-level errors are thrown as awaited rejections, so wrap the loop in `try`/`catch` if you need to recover mid-stream.

---

# Streaming

Source: https://blazen.dev/docs/guides/python/streaming
Language: python
Section: guides

## Stream vs Routing Events

Routing events flow between steps inside a workflow. Stream events are published outward for external consumers -- they do not affect step routing.

## Publishing Stream Events

Define a custom event by subclassing `Event`, then use `ctx.write_event_to_stream()` inside any step to push it to the external stream:

```python
class ProgressEvent(Event):
    step: int
    message: str

@step
async def process(ctx: Context, ev: Event):
    for i in range(3):
        ctx.write_event_to_stream(ProgressEvent(step=i, message=f"Processing step {i}"))
    return StopEvent(result={"done": True})
```

`ctx.write_event_to_stream()` is synchronous -- do not `await` it.

## Consuming Stream Events

After starting a workflow, iterate over `handler.stream_events()` to receive stream events as they arrive:

```python
handler = await wf.run()

async for event in handler.stream_events():
    print(f"[{event.event_type}] step={event.step}")

result = await handler.result()
```

`handler.stream_events()` returns an async iterator. It completes once the workflow finishes and all buffered events have been yielded.

## Streaming Local Inference

The local-inference providers (`MistralRsProvider`, `LlamaCppProvider`) stream tokens incrementally through a callback. Each call to `provider.stream(messages, on_chunk, options)` resolves once the model has finished generating; the callback is invoked once per chunk in between.

```python
from blazen import ChatMessage, MistralRsOptions, MistralRsProvider

provider = MistralRsProvider(options=MistralRsOptions("mistralai/Mistral-7B-Instruct-v0.3"))

def on_chunk(chunk: dict) -> None:
    if delta := chunk.get("delta"):
        print(delta, end="", flush=True)
    if reason := chunk.get("finish_reason"):
        print(f"\n[finish: {reason}]")

await provider.stream([ChatMessage.user("Tell me a joke")], on_chunk)
```

The chunk dict mirrors the typed `InferenceChunk` class registered in `blazen`: it carries `delta` (incremental text), `reasoning_delta` (incremental reasoning content for thinking models), `tool_calls` (any tool calls completed in this chunk), and `finish_reason` (set on the final chunk only).

### llama.cpp Backend

`LlamaCppProvider.stream()` follows the same callback pattern. Chunks mirror the typed `LlamaCppInferenceChunk`, which exposes `delta` and `finish_reason` (no reasoning split, no native tool calls):

```python
from blazen import ChatMessage, LlamaCppOptions, LlamaCppProvider

provider = LlamaCppProvider(
    options=LlamaCppOptions(model_path="/models/llama-3.2-1b-q4_k_m.gguf")
)

def on_chunk(chunk: dict) -> None:
    if delta := chunk.get("delta"):
        print(delta, end="", flush=True)

await provider.stream([ChatMessage.user("Hello!")], on_chunk)
```

### Typed Async Iterators

Alongside the callback API, `blazen` exposes typed async iterators -- `InferenceChunkStream` (mistral.rs) and `LlamaCppInferenceChunkStream` (llama.cpp) -- that yield `InferenceChunk` / `LlamaCppInferenceChunk` instances directly. They implement `__aiter__` / `__anext__` so they consume cleanly in an `async for` loop:

```python
async for chunk in stream:  # InferenceChunkStream or LlamaCppInferenceChunkStream
    if chunk.delta:
        print(chunk.delta, end="", flush=True)
    if chunk.finish_reason:
        break
```

Iteration terminates with `StopAsyncIteration` once the underlying engine stream is exhausted.

Use `InferenceChunkStream` for mistral.rs models and `LlamaCppInferenceChunkStream` for llama.cpp; the per-chunk fields differ (see `InferenceChunk` vs `LlamaCppInferenceChunk` above), but the iteration shape is identical.

---

# Streaming

Source: https://blazen.dev/docs/guides/rust/streaming
Language: rust
Section: guides

## Stream vs Routing Events

Routing events flow between steps -- they are the internal data that drives the workflow forward. Stream events are different: they are published for external observation without affecting the workflow's execution path.

Use `ctx.write_event_to_stream()` inside any step to publish a stream event.

## Publishing Stream Events

Define a stream event the same way as any other event, then write it to the stream from within a step:

```rust
use blazen::prelude::*;

#[derive(Debug, Clone, Serialize, Deserialize, Event)]
struct ProgressEvent {
    step: usize,
    message: String,
}

#[step]
async fn process(event: StartEvent, ctx: Context) -> Result<StopEvent, WorkflowError> {
    for i in 0..3 {
        ctx.write_event_to_stream(ProgressEvent {
            step: i,
            message: format!("Processing step {}", i),
        });
    }
    Ok(StopEvent { result: serde_json::json!({"done": true}) })
}
```

Stream events do not need to match any step's input type. They are silently forwarded to any active subscriber.

## Subscribing to Events

After starting a workflow, call `stream_events()` on the handler to receive stream events as they are published:

```rust
let handler = workflow.run(input).await?;
let mut stream = handler.stream_events();

while let Some(event) = stream.next().await {
    println!("Stream event: {:?}", event);
}

let result = handler.result().await?;
println!("{}", result.event.to_json());
```

The stream completes automatically when the workflow finishes. You can consume stream events and await the final result independently.

---

# WASM Agent

Source: https://blazen.dev/docs/guides/wasm/agent
Language: wasm
Section: guides

## Overview

The `runAgent` function executes an agentic tool-calling loop entirely within the WASM module. The model decides which tools to call, the WASM runtime invokes each tool's JavaScript `handler` function, feeds the result back, and repeats until the model finishes or `maxIterations` is reached.

## Basic agent

Each tool object is `{ name, description, parameters, handler }`. The `handler` is called with the parsed argument object and may be sync or async.

```typescript
import init, { CompletionModel, ChatMessage, runAgent } from '@blazen/sdk';

await init();

// The WASM SDK reads OPENAI_API_KEY from the runtime environment.
const model = CompletionModel.openai();

const tools = [
  {
    name: 'getWeather',
    description: 'Get the current weather for a city',
    parameters: {
      type: 'object',
      properties: { city: { type: 'string' } },
      required: ['city'],
    },
    handler: async (args) => {
      // Bare value: dispatcher wraps as { data: <value>, llm_override: null }.
      return { temp: 22, condition: 'cloudy', city: args.city };
    },
  },
];

const result = await runAgent(
  model,
  [ChatMessage.user('What is the weather in Tokyo?')],
  tools,
  { maxIterations: 5 },
);

console.log(result.content);
console.log(`Iterations: ${result.iterations}`);
```

## Tool handlers

A tool's `handler` is called with the parsed argument object that matches the tool's JSON Schema. It must return one of the following:

- **A bare value.** Anything JSON-serializable — string, number, object, array. The WASM dispatcher wraps it automatically as a `ToolOutput { data: <value>, llm_override: null }`.
- **A structured `ToolOutput`.** A literal `{ data, llmOverride? }` object (or snake-cased `{ data, llm_override? }` — both are accepted). `data` is what your code sees programmatically; `llmOverride` is what gets sent back to the model on the next turn.

```typescript
const tools = [
  {
    name: 'search',
    description: 'Search the documentation',
    parameters: {
      type: 'object',
      properties: { query: { type: 'string' } },
      required: ['query'],
    },
    handler: async (args) => {
      const hits = await fetch(`/api/search?q=${args.query}`).then((r) => r.json());
      // Bare-value return: hits flow through to the model verbatim.
      return hits;
    },
  },
  {
    name: 'calculate',
    description: 'Evaluate an arithmetic expression',
    parameters: {
      type: 'object',
      properties: { expression: { type: 'string' } },
      required: ['expression'],
    },
    handler: (args) => ({ result: eval(args.expression) }),
  },
];
```

If a tool throws (or rejects), the runtime surfaces the error as a tool-result message containing the error text, allowing the model to recover. If the handler returns nothing (`undefined`), the dispatcher records an empty result.

## Structured tool results

For larger payloads, return a `ToolOutput` literal so the caller-visible `data` and the model-visible override can diverge. Both `llmOverride` (camelCase) and `llm_override` (snake_case) are accepted by the dispatcher; the spelling is normalized before deserialization.

```typescript
const tools = [
  {
    name: 'fetchProfile',
    description: 'Fetch a full user profile',
    parameters: {
      type: 'object',
      properties: { userId: { type: 'string' } },
      required: ['userId'],
    },
    handler: async (args) => {
      const profile = await db.users.findById(args.userId);
      return {
        // Caller (your application code) gets the full record.
        data: profile,
        // Model sees a compact text summary on the next turn.
        llmOverride: {
          kind: 'text',
          text: `User ${profile.name} (id=${profile.id}, ${profile.role})`,
        },
      };
    },
  },
];
```

The `LlmPayload` shape used by `llmOverride` has four variants — see the [WASM API reference](/docs/api/wasm#llmpayload) for the full table:

```typescript
type LlmPayload =
  | { kind: 'text'; text: string }
  | { kind: 'json'; value: any }
  | { kind: 'parts'; parts: ContentPart[] }
  | { kind: 'provider_raw'; provider: ProviderId; value: any };
```

### Inspecting `tool_result` on returned messages

After `runAgent` resolves, every entry in `result.messages` matches the tsify-generated `ChatMessage` interface (snake_case fields). Tool-result messages carry a `tool_result?: ToolOutput` whenever the handler returned a non-string `data` or supplied an `llm_override`:

```typescript
const result = await runAgent(model, messages, tools, { maxIterations: 5 });

for (const msg of result.messages) {
  if (msg.role !== 'tool') continue;

  if (msg.tool_result) {
    // Structured payload: full data plus optional override.
    console.log('tool', msg.name, 'returned', msg.tool_result.data);
    if (msg.tool_result.llm_override) {
      console.log('  (model saw:', msg.tool_result.llm_override, ')');
    }
  } else {
    // Plain string return: the result lives in content as text.
    console.log('tool', msg.name, 'returned', msg.content);
  }
}
```

> **Field naming.** The tsify-generated interface preserves Rust snake_case, so the field is `tool_result` and the override is `llm_override`. The wasm-bindgen `ChatMessage` class additionally exposes `.toolCallId` and `.name` getters in camelCase.

## Agent options

Pass an options object as the fourth argument:

```typescript
const result = await runAgent(model, messages, tools, {
  toolConcurrency: 2,
  maxIterations: 5,
  systemPrompt: 'You are a helpful research assistant.',
  temperature: 0.3,
  maxTokens: 1000,
  addFinishTool: true,
});
```

| Option | Type | Default | Description |
|---|---|---|---|
| `toolConcurrency` | `number` | `0` | Max concurrent tool calls per round (`0` = unlimited) |
| `maxIterations` | `number` | `10` | Maximum tool-calling rounds |
| `systemPrompt` | `string` | -- | System prompt prepended to the conversation |
| `temperature` | `number` | -- | Sampling temperature |
| `maxTokens` | `number` | -- | Max tokens per completion call |
| `addFinishTool` | `boolean` | `false` | Add a built-in "finish" tool the model can call to signal completion |

## Structured output

Combine tool calling with a single-purpose tool to extract structured data:

```typescript
const extractTools = [
  {
    name: 'extractContact',
    description: 'Extract contact information from text',
    parameters: {
      type: 'object',
      properties: {
        name: { type: 'string' },
        email: { type: 'string' },
        phone: { type: 'string' },
      },
      required: ['name'],
    },
    handler: async (args) => {
      // The model fills `args` with the extracted fields; we just echo
      // them back as the tool result so the loop can terminate.
      return args;
    },
  },
];

const result = await runAgent(
  model,
  [ChatMessage.user('Extract: John Doe, john@example.com, 555-1234')],
  extractTools,
  { maxIterations: 1 },
);

// The extracted fields live on the tool-result message.
const toolMsg = result.messages.find((m) => m.role === 'tool');
const extracted = toolMsg?.tool_result?.data ?? JSON.parse(toolMsg?.content ?? '{}');
console.log(extracted);
// { name: "John Doe", email: "john@example.com", phone: "555-1234" }
```

## Agent result

`runAgent` resolves with an `AgentResult`:

```typescript
interface AgentResult {
  content?: string;          // Final text response
  messages: ChatMessage[];   // Full message history (tsify shape)
  iterations: number;        // Number of tool-calling iterations
  totalUsage?: TokenUsage;   // Aggregated token usage
  totalCost?: number;        // Aggregated cost in USD
}
```

## Next steps

- Combine agents with workflows in the [WASM Workflows](/docs/guides/wasm/workflows) guide.
- Deploy to edge platforms with the [Edge Deployment](/docs/guides/wasm/deployment) guide.

---

# Context

Source: https://blazen.dev/docs/guides/node/context
Language: node
Section: guides

## What is Context?

Context is a key-value store shared across all steps in a workflow run. Every step handler receives a `ctx` object as its second argument, giving each step access to the same shared state.

`Context` exposes two explicit namespaces alongside the legacy smart-routing shortcuts (`ctx.set` / `ctx.get` / `ctx.setBytes` / `ctx.getBytes`):

- **`ctx.state`** -- persistable values. Survives `pause()` / `resume()` and checkpoint stores.
- **`ctx.session`** -- in-process-only values. Excluded from snapshots. Use for request IDs, rate-limit counters, ephemeral caches, and anything that should not survive pause/resume.

The legacy `ctx.set` / `ctx.get` shortcuts still work and route into the state namespace under the hood.

## StateValue Type

All context values are represented by the `StateValue` type:

```typescript
type StateValue = string | number | boolean | null | Buffer | StateValue[] | { [key: string]: StateValue };
```

`ctx.set()` accepts any JSON-serializable subset of `StateValue` (everything except `Buffer`). For binary data, use the dedicated `setBytes`/`getBytes` methods described below.

`ctx.get()` returns `Promise<StateValue | null>` and now returns data for **all** `StateValue` variants, including arrays and nested objects. It no longer silently drops bytes or native data -- if a key was stored via `setBytes`, `get()` will return the data as a `Buffer`.

## Setting and Getting Values

```javascript
wf.addStep("store_data", ["blazen::StartEvent"], async (event, ctx) => {
  await ctx.set("user_id", "user_123");
  await ctx.set("doc_count", 5);
  await ctx.set("tags", ["rust", "workflow"]);
  await ctx.set("config", { retries: 3, verbose: true });
  return { type: "NextEvent" };
});

wf.addStep("use_data", ["NextEvent"], async (event, ctx) => {
  const userId = await ctx.get("user_id");     // "user_123"
  const docCount = await ctx.get("doc_count"); // 5
  const tags = await ctx.get("tags");          // ["rust", "workflow"]
  const config = await ctx.get("config");      // { retries: 3, verbose: true }
  return { type: "blazen::StopEvent", result: { user: userId, docs: docCount } };
});
```

**Important:** `ctx.set()` and `ctx.get()` are async — always use `await`.

## Run ID

Each workflow execution is assigned a unique run ID. Access it from the context:

```javascript
const runId = await ctx.runId(); // Returns a UUID string
```

## Binary Storage

While `ctx.get()` now returns data for all value types (including binary), you can still use `ctx.setBytes()` and `ctx.getBytes()` for explicit binary storage. These methods are useful when you want to make it clear that a value is raw binary data, or when you need to store data that should not be JSON-serialized. Binary data persists through pause/resume/checkpoint.

```javascript
// Store raw binary data
const pixels = Buffer.from([0xff, 0x00, 0x00, 0xff]);
await ctx.setBytes("image-pixels", pixels);

// Retrieve it in another step
const data = await ctx.getBytes("image-pixels"); // Buffer | null
```

## Manual Event Routing

Use `ctx.sendEvent()` to emit an event manually instead of returning one from the step handler:

```javascript
await ctx.sendEvent({ type: "Continue" }); // Async
return null; // Don't return an event when using sendEvent
```

## State vs Session

`Context` exposes two explicit namespaces that make your intent clear at the call site:

| Namespace | Survives pause/resume | Use for |
|---|---|---|
| **`ctx.state`** | yes | persistable values (JSON, bytes) |
| **`ctx.session`** | no (see pause policy) | in-process-only values, request-scoped state |

```typescript
wf.addStep("setup", ["blazen::StartEvent"], async (event, ctx) => {
  // Persistable state -- survives pause/resume and checkpoints.
  await ctx.state.set("inputPath", "data.csv");
  await ctx.state.set("rowCount", 0);
  await ctx.state.setBytes("thumbnail", Buffer.from([0x89, 0x50, 0x4e, 0x47]));

  // In-process-only state -- excluded from snapshots.
  await ctx.session.set("reqId", "abc123");
  await ctx.session.set("rateLimitCount", 0);

  const hasReq = await ctx.session.has("reqId");
  const reqId = await ctx.session.get("reqId");
  await ctx.session.remove("rateLimitCount");

  return { type: "blazen::StopEvent", result: { ok: true } };
});
```

`ctx.state` routes through the same dispatch as `ctx.set` and exposes `set / get / setBytes / getBytes`. `ctx.session` exposes `set / get / has / remove`. The legacy `ctx.set` / `ctx.get` still work as smart-routing shortcuts and target the state namespace under the hood.

:::caution[JS object identity is NOT preserved on Node]
Unlike the Python and WASM bindings, Node stores session values as `serde_json::Value` rather than as a live JS reference. The reason: napi-rs's `Reference<T>` is `!Send` (its `Drop` must run on the v8 main thread), and tokio worker threads cannot safely hold live JS references -- this is a documented architectural limitation.

In practice, `await ctx.session.set("k", {name: "alice"})` followed by `await ctx.session.get("k")` returns a **plain object equal to** `{name: "alice"}`, not the same object. `ctx.session` is still functionally distinct from `ctx.state` -- session values are excluded from snapshots, state values are not -- but for true identity preservation of live JS objects across steps you must use the Python or WASM bindings.
:::

### Pause policy for `ctx.session`

Session entries are deliberately excluded from snapshots. When you call `handler.pause()`, the workflow's `session_pause_policy` (default `pickle_or_error`; other policies: `warn_drop`, `hard_error`) governs what happens to them. The practical rule: put anything that **must** survive `pause()` / `resume()` in `ctx.state`, and everything else in `ctx.session`.

## BlazenState

`BlazenState` is a base class for typed state objects that store each field individually in the workflow context. Instead of manually calling `ctx.set()` and `ctx.get()` for every piece of state, you define a class with typed fields and let Blazen handle serialization, storage tiers, and restoration automatically.

### Defining a State Class

Extend `BlazenState` and declare a static `meta` property to control how fields are stored:

```typescript
import { BlazenState, BlazenStateMeta, CallbackFieldStore } from "blazen";

class AgentState extends BlazenState {
  static meta: BlazenStateMeta = {
    // Fields that should not survive pause/resume snapshots
    transient: ["scratchpad"],

    // Per-field storage overrides
    storeBy: {
      embeddings: new CallbackFieldStore({
        saveFn: async (key, value, ctx) => {
          await ctx.setBytes(key, Buffer.from(JSON.stringify(value)));
        },
        loadFn: async (key, ctx) => {
          const buf = await ctx.getBytes(key);
          return buf ? JSON.parse(buf.toString()) : null;
        },
      }),
    },
  };

  conversationHistory: string[] = [];
  embeddings: number[][] = [];
  scratchpad: Map<string, string> = new Map();
  retryCount: number = 0;

  // Called after loadFrom() to recreate transient fields
  restore(): void {
    this.scratchpad = new Map();
  }
}
```

### Saving and Loading State

Use `saveTo()` to persist the state into a workflow context under a given key, and the static `loadFrom()` to restore it:

```typescript
wf.addStep("init", ["blazen::StartEvent"], async (event, ctx) => {
  const state = new AgentState();
  state.conversationHistory.push("Hello");
  state.retryCount = 1;

  await state.saveTo(ctx, "state");
  return { type: "ProcessEvent" };
});

wf.addStep("process", ["ProcessEvent"], async (event, ctx) => {
  const state = await AgentState.loadFrom<AgentState>(ctx, "state");

  console.log(state.conversationHistory); // ["Hello"]
  console.log(state.retryCount);          // 1

  state.conversationHistory.push("World");
  await state.saveTo(ctx, "state");

  return { type: "blazen::StopEvent", result: { history: state.conversationHistory } };
});
```

### How Field Storage Works

Each field on the state object is stored individually in the context. This means you can assign different storage strategies to different fields using the `storeBy` record. Any field listed in `storeBy` uses its corresponding `FieldStore` implementation; all other (non-transient) fields use the default context `set`/`get` methods.

The `FieldStore` interface has two methods:

```typescript
interface FieldStore {
  save(key: string, value: any, ctx: Context): Promise<void>;
  load(key: string, ctx: Context): Promise<any>;
}
```

`CallbackFieldStore` is a convenience class that constructs a `FieldStore` from a pair of callbacks:

```typescript
new CallbackFieldStore({
  saveFn: async (key, value, ctx) => { /* custom save logic */ },
  loadFn: async (key, ctx) => { /* custom load logic */ },
});
```

### Transient Fields

Fields listed in the `transient` array are excluded from `saveTo()` and will not be persisted. After `loadFrom()` restores the saved fields, it calls `restore()` on the instance, giving you a place to recreate transient state such as caches, open connections, or in-memory indexes. Transient fields do not survive pause/resume snapshots, but `restore()` ensures they are always initialized when the state is loaded.

---

# Context

Source: https://blazen.dev/docs/guides/python/context
Language: python
Section: guides

## What is Context?

A key-value store shared across all steps in a workflow run.

Blazen exposes two explicit namespaces alongside the smart-routing `ctx.set` / `ctx.get` shortcuts: **`ctx.state`** for persistable values (survives pause/resume and checkpoints) and **`ctx.session`** for live in-process references (identity-preserving within a run, excluded from snapshots). See [State vs Session](#state-vs-session) below.

## Setting and Getting Values

`ctx.set(key, value)` stores any Python value using a 4-tier dispatch:

1. **`bytes` / `bytearray`** → raw binary (survives snapshots)
2. **JSON-serializable** (`dict`, `list`, `str`, `int`, `float`, `bool`, `None`) → JSON (survives snapshots)
3. **Picklable objects** (Pydantic models, dataclasses, custom classes) → pickled automatically (survives snapshots)
4. **Unpicklable objects** (DB connections, file handles, sockets, lambdas) → live in-process reference (same-process only, excluded from snapshots)

`ctx.get` returns the original Python type for all four tiers.

```python
from pydantic import BaseModel

class UserProfile(BaseModel):
    name: str
    score: float

class NextEvent(Event):
    pass

@step
async def store_data(ctx: Context, ev: Event):
    # JSON-serializable values (stored as JSON)
    ctx.set("user_id", "user_123")
    ctx.set("doc_count", 5)
    ctx.set("tags", ["admin", "active"])

    # Raw bytes (stored as binary)
    ctx.set("thumbnail", b"\x89PNG\r\n...")

    # Pydantic model (pickled automatically)
    ctx.set("profile", UserProfile(name="Alice", score=0.95))
    return NextEvent()

@step
async def use_data(ctx: Context, ev: NextEvent):
    user_id = ctx.get("user_id")          # str
    doc_count = ctx.get("doc_count")      # int
    thumbnail = ctx.get("thumbnail")      # bytes
    profile = ctx.get("profile")          # UserProfile
    return StopEvent(result={"user": user_id, "name": profile.name})
```

**Important:** `ctx.set()` and `ctx.get()` are synchronous — no `await`.

## Run ID

```python
run_id = ctx.run_id()  # Synchronous, returns a UUID string
```

## Binary Storage

Since `ctx.set()` now handles `bytes` and `bytearray` natively (stored as raw binary), you can pass binary data directly:

```python
@step
async def store(ctx: Context, ev: Event):
    ctx.set("model", b"\x00\x01\x02...")  # stored as raw bytes
    return NextEvent()

@step
async def load(ctx: Context, ev: NextEvent):
    raw = ctx.get("model")  # bytes
    return StopEvent(result=raw)
```

`ctx.set_bytes()` and `ctx.get_bytes()` remain available as explicit convenience aliases for binary data. They behave identically to calling `ctx.set()` / `ctx.get()` with `bytes` values. Binary data persists through pause/resume/checkpoint.

## Manual Event Routing

```python
ctx.send_event(ContinueEvent())  # Synchronous, routes manually
return None  # Don't return an event when using send_event
```

## Streaming Events Externally

```python
ctx.write_event_to_stream(ProgressEvent(...))  # Synchronous, external broadcast
```

## State vs Session

`Context` exposes two explicit namespaces that make your intent clear at the call site:

| Namespace | Survives pause/resume | Use for |
|---|---|---|
| **`ctx.state`** | yes | persistable values (JSON, bytes, picklable objects) |
| **`ctx.session`** | no (see pause policy) | live in-process references — identity-preserving |

```python
import sqlite3
from blazen import step, Context, StartEvent, StopEvent

@step
async def setup(ctx: Context, ev: StartEvent) -> StopEvent:
    # Persistable state — survives pause/resume and checkpoints.
    ctx.state["input_path"] = "data.csv"
    ctx.state["row_count"] = 0

    # Live in-process references — identity is preserved.
    conn = sqlite3.connect(":memory:")
    ctx.session["db"] = conn
    assert ctx.session["db"] is conn  # same object, always

    return StopEvent(result={"ok": True})
```

Both namespaces support the Python dict protocol (`ctx.state["k"] = v`, `"k" in ctx.session`, etc.). `ctx.state` routes through the same 4-tier dispatch as `ctx.set` and exposes `set / get / set_bytes / get_bytes` plus the dict protocol. `ctx.session` exposes `set / get / has / remove` plus the dict protocol. The legacy `ctx.set` / `ctx.get` still work as smart-routing shortcuts.

> `result.result` preserves `is`-identity for non-JSON values — you can pass class instances, Pydantic models, and live DB connections through `StopEvent.result` and get the *same* object back.

### Pause policy for `ctx.session`

Because session entries are live references, they are deliberately excluded from snapshots. When you call `handler.pause()`, the workflow's `session_pause_policy` governs what happens to them. The default (`pickle_or_error`) attempts to pickle each entry into the snapshot and raises a clear error if any entry can't be serialised. Other policies (`warn_drop`, `hard_error`) let workflows opt into ephemeral or strict behaviour. The practical rule: put anything that *must* survive `pause()` / `resume()` in `ctx.state`, and everything else in `ctx.session`.

## BlazenState

For most cases prefer `ctx.state` and `ctx.session` directly — they're simpler and cover the common patterns. `BlazenState` is for codebases that want typed, structured state with per-field custom storage (e.g. mapping one field to a file on disk, another to a database row). If your state is just a bag of values, use the namespaces.

`BlazenState` is a base class for typed workflow state with per-field context storage. Instead of manually calling `ctx.set()` for each piece of data, you define a `@dataclass` subclass and let Blazen store each field individually using the optimal storage tier.

### Basic Example

```python
import sqlite3
from dataclasses import dataclass, field
from blazen import BlazenState, Context, Event, StartEvent, StopEvent, step, Workflow

@dataclass
class PipelineState(BlazenState):
    input_path: str = ""
    doc_count: int = 0
    conn: sqlite3.Connection | None = None

    class Meta:
        transient = {"conn"}
        store_by = {}

    def restore(self):
        if self.input_path:
            self.conn = sqlite3.connect(self.input_path)

class ProcessEvent(Event):
    pass

@step
async def setup(ctx: Context, ev: Event):
    state = PipelineState(input_path="/tmp/data.db", doc_count=0)
    state.restore()  # Opens the sqlite3 connection
    ctx.set("state", state)
    return ProcessEvent()

@step
async def process(ctx: Context, ev: ProcessEvent):
    state = ctx.get("state")          # PipelineState with all fields restored
    cursor = state.conn.cursor()      # Transient field recreated by restore()
    cursor.execute("SELECT count(*) FROM docs")
    state.doc_count = cursor.fetchone()[0]
    ctx.set("state", state)           # Persist updated state
    return StopEvent(result={"docs": state.doc_count})
```

When you call `ctx.set("state", my_state)` with a `BlazenState` subclass, Blazen stores each field individually under the hood. When you call `ctx.get("state")`, it reconstructs the object field-by-field and then calls `restore()` to recreate any transient fields.

### Storage Tiers

Each field is automatically stored using the best tier based on its type:

| Tier | When Used | Survives Snapshots |
|---|---|---|
| **JSON** | `str`, `int`, `float`, `bool`, `None`, `dict`, `list` | Yes |
| **Bytes** | `bytes`, `bytearray` | Yes |
| **Pickle** | Pydantic models, dataclasses, other serializable objects | Yes |
| **Live reference** | Objects listed in `Meta.transient` | No |

Transient fields (like database connections, file handles, sockets) are excluded from serialization entirely. They are set to `None` in the snapshot and recreated by your `restore()` method when the state is loaded back.

### Custom Persistence with FieldStore

For fields that need custom storage logic (e.g., writing large blobs to S3 instead of the context), implement the `FieldStore` protocol or use `CallbackFieldStore`:

```python
from blazen import BlazenState, CallbackFieldStore

def save_to_s3(key: str, value: bytes) -> None:
    s3.put_object(Bucket="my-bucket", Key=key, Body=value)

def load_from_s3(key: str) -> bytes:
    return s3.get_object(Bucket="my-bucket", Key=key)["Body"].read()

@dataclass
class ModelState(BlazenState):
    name: str = ""
    weights: bytes = b""

    class Meta:
        transient = set()
        store_by = {
            "weights": CallbackFieldStore(
                save_fn=save_to_s3,
                load_fn=load_from_s3,
            ),
        }
```

When Blazen stores `ModelState`, the `weights` field is routed through your `CallbackFieldStore` instead of the default context storage. All other fields use their automatic tier.

### Key Points

- **Transient fields** are excluded from serialization. They are `None` after a snapshot restore until `restore()` recreates them.
- **`restore()`** is called automatically by `ctx.get()` after all serializable fields are populated. Override it to reconnect databases, reopen files, or rebuild caches.
- **Per-field storage** means each field is an independent context entry. Updating one field and calling `ctx.set()` again only overwrites the changed fields, not the entire object.

---

# Context

Source: https://blazen.dev/docs/guides/rust/context
Language: rust
Section: guides

## What is Context?

Context is a key-value store shared across all steps in a workflow run. Use it to pass data between steps that doesn't fit neatly into events -- configuration, intermediate results, or anything you want accessible from any step without threading it through event payloads.

## Setting and Getting Values

Store values with `ctx.set()` and retrieve them with `ctx.get()`. Values are stored as `serde_json::Value`, so any serializable type works:

```rust
use blazen::prelude::*;

#[step]
async fn store_data(event: StartEvent, ctx: Context) -> Result<NextEvent, WorkflowError> {
    ctx.set("user_id", serde_json::json!("user_123"));
    ctx.set("doc_count", serde_json::json!(5));
    Ok(NextEvent { /* ... */ })
}

#[step]
async fn use_data(event: NextEvent, ctx: Context) -> Result<StopEvent, WorkflowError> {
    let user_id: String = serde_json::from_value(ctx.get("user_id").unwrap()).unwrap();
    let doc_count: i64 = serde_json::from_value(ctx.get("doc_count").unwrap()).unwrap();
    Ok(StopEvent { result: serde_json::json!({"user": user_id, "docs": doc_count}) })
}
```

## Run ID

Each workflow run is assigned a unique identifier. Access it from any step via `ctx.run_id()`:

```rust
#[step]
async fn log_run(event: StartEvent, ctx: Context) -> Result<StopEvent, WorkflowError> {
    println!("Executing run: {}", ctx.run_id());
    Ok(StopEvent { result: serde_json::json!({"run": ctx.run_id()}) })
}
```

## Binary Storage

Use `ctx.set_bytes()` and `ctx.get_bytes()` to store raw binary data. No serialization requirement -- store any type by converting to bytes yourself (e.g., MessagePack, protobuf, bincode). Binary data persists through pause/resume/checkpoint via efficient `serde_bytes` serialization.

```rust
#[step]
async fn store_model(event: StartEvent, ctx: Context) -> Result<NextEvent, WorkflowError> {
    let weights: Vec<u8> = vec![0x01, 0x02, 0x03, 0x04];
    ctx.set_bytes("model-weights", weights);
    Ok(NextEvent { /* ... */ })
}

#[step]
async fn use_model(event: NextEvent, ctx: Context) -> Result<StopEvent, WorkflowError> {
    let weights = ctx.get_bytes("model-weights").expect("weights should exist");
    Ok(StopEvent { result: serde_json::json!({"weight_count": weights.len()}) })
}
```

## StateValue and Direct Access

Under the hood, every context value is stored as a `StateValue` enum:

```rust
pub enum StateValue {
    Json(serde_json::Value),   // structured, serializable data
    Bytes(BytesWrapper),       // raw binary data
    Native(BytesWrapper),      // platform-serialized opaque objects
}
```

The `Json` and `Bytes` variants correspond to the typed and binary APIs shown above. The `Native` variant holds opaque bytes produced by a platform serializer (e.g., Python pickle). It exists so that a value set in a Python or Node.js binding step can round-trip through Rust steps without losing fidelity -- Rust code can inspect or forward the raw bytes, but shouldn't attempt to deserialize them as JSON.

Use `ctx.set_value()` and `ctx.get_value()` to work with `StateValue` directly:

```rust
use blazen::prelude::*;
use blazen::context::StateValue;

#[step]
async fn store_raw(event: StartEvent, ctx: Context) -> Result<NextEvent, WorkflowError> {
    // Store a JSON StateValue directly
    ctx.set_value("config", StateValue::Json(serde_json::json!({"retries": 3})));

    // Store raw bytes as a Native value (e.g., forwarded from a Python step)
    let opaque: Vec<u8> = vec![0x80, 0x04, 0x95]; // pickle header bytes
    ctx.set_value("py_obj", StateValue::Native(opaque.into()));

    Ok(NextEvent { /* ... */ })
}

#[step]
async fn read_raw(event: NextEvent, ctx: Context) -> Result<StopEvent, WorkflowError> {
    match ctx.get_value("config") {
        Some(StateValue::Json(v)) => println!("retries = {}", v["retries"]),
        Some(StateValue::Bytes(b)) => println!("got {} raw bytes", b.len()),
        Some(StateValue::Native(b)) => println!("got {} native bytes", b.len()),
        None => println!("key not found"),
    }
    Ok(StopEvent { result: serde_json::json!("done") })
}
```

**When to use each API:**

| Method | Stored as | Best for |
|--------|-----------|----------|
| `ctx.set(key, value)` / `ctx.get(key)` | `StateValue::Json` | Typed Rust data (`String`, structs, numbers) |
| `ctx.set_bytes(key, data)` / `ctx.get_bytes(key)` | `StateValue::Bytes` | Raw binary blobs (model weights, protobuf) |
| `ctx.set_value(key, sv)` / `ctx.get_value(key)` | Any `StateValue` | Direct control, cross-language interop, forwarding `Native` values |

## Manual Event Routing

Instead of returning an event from a step, you can use `ctx.send_event()` to route events programmatically. This is useful when a step needs to emit multiple events or decide at runtime which path the workflow takes:

```rust
#[step]
async fn branch(event: StartEvent, ctx: Context) -> Result<(), WorkflowError> {
    let score: f64 = serde_json::from_value(ctx.get("score").unwrap()).unwrap();

    if score > 0.8 {
        ctx.send_event(HighScoreEvent { score });
    } else {
        ctx.send_event(LowScoreEvent { score });
    }
    Ok(())
}
```

When using `send_event`, the step returns `()` since routing is handled explicitly rather than through the return type.

## BlazenState

`BlazenState` is primarily a Python, Node.js, and WASM concept. In those languages, you define a class that extends `BlazenState` to get automatic per-field persistence -- each field is individually stored and retrieved from the context without manual key management.

In Rust, there is no `BlazenState` base class. Instead, you achieve the same per-field storage by calling `set_value()` and `get_value()` with explicit keys, using the `StateValue` variants described in the [StateValue and Direct Access](#statevalue-and-direct-access) section above. This approach is idiomatic Rust: explicit, zero-cost, and fully type-safe.

### Per-Field Storage in Rust

A common pattern is to store some fields as JSON (for structured, inspectable data) and others as raw bytes (for binary or pre-serialized data):

```rust
use blazen::prelude::*;
use blazen::context::StateValue;

/// A struct whose fields are stored individually in the context.
struct PipelineState {
    config: serde_json::Value,
    embeddings: Vec<u8>,
}

impl PipelineState {
    fn save(&self, ctx: &Context) {
        ctx.set_value("pipeline:config", StateValue::Json(self.config.clone()));
        ctx.set_value("pipeline:embeddings", StateValue::Bytes(self.embeddings.clone().into()));
    }

    fn load(ctx: &Context) -> Option<Self> {
        let config = match ctx.get_value("pipeline:config")? {
            StateValue::Json(v) => v,
            _ => return None,
        };
        let embeddings = match ctx.get_value("pipeline:embeddings")? {
            StateValue::Bytes(b) => b.to_vec(),
            _ => return None,
        };
        Some(Self { config, embeddings })
    }
}
```

### Binding Authors: Using `StateValue::Native`

If you are building a language binding (e.g., via PyO3 or napi-rs), the `Native(BytesWrapper)` variant lets you store platform-serialized objects that Rust steps can forward without deserializing:

```rust
use blazen::context::StateValue;

// In a binding layer: serialize a Python/Node object to bytes and store it
fn store_platform_object(ctx: &Context, key: &str, serialized: Vec<u8>) {
    ctx.set_value(key, StateValue::Native(serialized.into()));
}

// A Rust step can forward the value without interpreting its contents
fn forward_native(ctx: &Context, src_key: &str, dst_key: &str) {
    if let Some(val @ StateValue::Native(_)) = ctx.get_value(src_key) {
        ctx.set_value(dst_key, val);
    }
}
```

This is how the Python and Node.js `BlazenState` implementations persist fields that have no natural JSON representation -- they serialize to platform-native bytes and store them as `Native` values that survive round-trips through Rust workflow steps.

---

# Edge Deployment

Source: https://blazen.dev/docs/guides/wasm/deployment
Language: wasm
Section: guides

## Overview

`blazen-wasm` is the deployment-ready WASM component designed for ZLayer and other edge platforms. It packages your Blazen workflows and agents into a single WASM binary that runs at the edge with minimal cold-start overhead.

## ZImagefile

A `ZImagefile` defines the build and deployment spec for your WASM component, similar to a Dockerfile:

```
FROM blazen-wasm:latest

COPY ./src /app/src
COPY ./package.json /app/package.json

RUN npm install --production
RUN blazen-wasm build --entry /app/src/index.ts --output /app/dist/handler.wasm

EXPOSE 8080
ENTRYPOINT ["blazen-wasm", "serve", "--port", "8080"]
```

## Project structure

A typical edge deployment looks like this:

```
my-blazen-edge/
  src/
    index.ts        # Entry point -- exports the request handler
    tools.ts        # Tool handler implementations
  ZImagefile
  package.json
```

## Entry point

Your entry point exports a handler function that receives HTTP requests:

```typescript
import init, { CompletionModel, ChatMessage, runAgent } from '@blazen/sdk';

await init();

export async function handler(request: Request): Promise<Response> {
  const { prompt } = await request.json();
  // The WASM SDK reads OPENROUTER_API_KEY from the runtime env.
  // Set it via your platform's secret store before invoking the handler.
  const model = CompletionModel.openrouter();

  const result = await runAgent(
    model,
    [ChatMessage.user(prompt)],
    tools,
    toolHandler,
    { maxIterations: 5 }
  );

  return new Response(JSON.stringify({ content: result.response.content }), {
    headers: { 'Content-Type': 'application/json' },
  });
}
```

## API key strategies

Edge functions need access to provider API keys without exposing them to clients.

### Secrets (recommended)

Store keys as platform secrets. ZLayer injects them as environment variables at runtime:

```bash
zlayer secret set OPENAI_API_KEY sk-...
```

```typescript
function getApiKey(): string {
  return process.env.OPENAI_API_KEY!;
}
```

### Encrypted config

Bundle an encrypted config file and decrypt at startup using a platform-provided key:

```typescript
import { decrypt } from './crypto';

const config = decrypt(await Deno.readFile('./config.enc'), process.env.DECRYPT_KEY!);
const apiKey = JSON.parse(config).openaiKey;
```

### Proxy

Route all LLM requests through a central proxy that injects keys server-side. The edge function never sees the raw upstream key. Configure your proxy to accept requests on an OpenAI-compatible `/v1` endpoint, then supply the short-lived proxy token via `OPENAI_API_KEY` in the runtime env:

```bash
# Set OPENAI_API_KEY to the per-deployment proxy token.
zlayer secret set OPENAI_API_KEY edge-token
```

```typescript
// The WASM SDK reads OPENAI_API_KEY from the runtime environment.
// The proxy at proxy.yourcompany.com intercepts requests,
// swaps 'edge-token' for the real API key, and forwards upstream.
const model = CompletionModel.openai();
```

### Runtime injection

The WASM SDK does **not** accept API keys as constructor arguments -- every `CompletionModel` factory reads from the runtime environment only. If you need to swap keys per-request, point the SDK at a proxy (see above) and have the proxy decide which upstream key to use based on per-request metadata such as a header:

```typescript
const callerToken = request.headers.get('X-Caller-Token');
// Forward callerToken to your proxy via the request URL or a custom header;
// the proxy uses it to look up the correct upstream API key.
```

## Deploying to ZLayer

```bash
# Build the WASM component
zlayer build

# Deploy
zlayer deploy --name my-blazen-agent --region us-east-1

# Check status
zlayer status my-blazen-agent
```

ZLayer handles TLS termination, auto-scaling, and geographic routing.

## Scaling

WASM components on ZLayer scale to zero by default and spin up in under 5ms. Key considerations:

- **Cold start** -- the WASM binary is pre-compiled to native code at deploy time. No JIT warmup.
- **Memory** -- each instance gets a dedicated linear memory. Default limit is 256 MB, configurable in the ZImagefile.
- **Concurrency** -- each instance handles one request at a time. ZLayer spawns additional instances automatically.
- **Regions** -- deploy to multiple regions with `--region us-east-1,eu-west-1,ap-northeast-1`.

## Other edge platforms

The `@blazen/sdk` WASM module works on any platform that supports WebAssembly:

- **Cloudflare Workers** -- import the SDK and call `init()` in your worker entry.
- **Vercel Edge Functions** -- use the SDK in an edge-runtime function.
- **Deno Deploy** -- import from npm and call `init()`.
- **Fastly Compute** -- compile the Rust crate directly to `wasm32-wasi`.

## Next steps

- See the full [WASM API Reference](/docs/api/wasm) for all available exports.
- Browse [WASM Examples](/docs/examples/wasm) for complete runnable projects.

---

# Branching

Source: https://blazen.dev/docs/guides/node/branching
Language: node
Section: guides

## Conditional Branching

Return different event types based on conditions:

```javascript
wf.addStep("classify", ["blazen::StartEvent"], async (event, ctx) => {
  if (event.text.toLowerCase().includes("good")) {
    return { type: "PositiveEvent", text: event.text };
  } else {
    return { type: "NegativeEvent", text: event.text };
  }
});
```

## Handling Branches

```javascript
wf.addStep("handle_positive", ["PositiveEvent"], async (event, ctx) => {
  return { type: "blazen::StopEvent", result: { sentiment: "positive", text: event.text } };
});

wf.addStep("handle_negative", ["NegativeEvent"], async (event, ctx) => {
  return { type: "blazen::StopEvent", result: { sentiment: "negative", text: event.text } };
});
```

## Fan-out

Return an array to dispatch to multiple branches simultaneously:

```javascript
wf.addStep("fan", ["blazen::StartEvent"], async (event, ctx) => {
  return [
    { type: "BranchA", value: "a" },
    { type: "BranchB", value: "b" },
  ];
});
```

Both branches execute concurrently. First `StopEvent` wins.

---

# Branching

Source: https://blazen.dev/docs/guides/python/branching
Language: python
Section: guides

## Conditional Branching

Define typed event subclasses and return them based on conditions:

```python
class PositiveEvent(Event):
    text: str
class NegativeEvent(Event):
    text: str

@step
async def classify(ctx: Context, ev: Event):
    if "good" in ev.text.lower():
        return PositiveEvent(text=ev.text)
    else:
        return NegativeEvent(text=ev.text)
```

## Handling Branches

Each handler declares the specific event subclass it accepts via the type annotation:

```python
@step
async def handle_positive(ctx: Context, ev: PositiveEvent):
    return StopEvent(result={"sentiment": "positive", "text": ev.text})

@step
async def handle_negative(ctx: Context, ev: NegativeEvent):
    return StopEvent(result={"sentiment": "negative", "text": ev.text})
```

## Fan-out

Define branch event subclasses and return a list to dispatch to multiple branches simultaneously:

```python
class BranchA(Event):
    value: str
class BranchB(Event):
    value: str

@step
async def fan_out(ctx: Context, ev: Event):
    return [BranchA(value="a"), BranchB(value="b")]
```

Both branches execute concurrently. First `StopEvent` wins.

---

# Branching

Source: https://blazen.dev/docs/guides/rust/branching
Language: rust
Section: guides

## Conditional Branching

Use `#[step(emits = [...])]` to declare multiple possible output types, then return a `StepOutput::Single`:

```rust
#[derive(Debug, Clone, Serialize, Deserialize, Event)]
struct PositiveEvent { text: String }

#[derive(Debug, Clone, Serialize, Deserialize, Event)]
struct NegativeEvent { text: String }

#[step(emits = [PositiveEvent, NegativeEvent])]
async fn classify(event: AnalyzeEvent, _ctx: Context) -> Result<StepOutput, WorkflowError> {
    if event.score > 0.5 {
        Ok(StepOutput::Single(Box::new(PositiveEvent { text: event.text })))
    } else {
        Ok(StepOutput::Single(Box::new(NegativeEvent { text: event.text })))
    }
}
```

## Handling Branches

Each branch has its own step:

```rust
#[step]
async fn handle_positive(event: PositiveEvent, _ctx: Context) -> Result<StopEvent, WorkflowError> {
    Ok(StopEvent { result: serde_json::json!({"sentiment": "positive", "text": event.text}) })
}

#[step]
async fn handle_negative(event: NegativeEvent, _ctx: Context) -> Result<StopEvent, WorkflowError> {
    Ok(StopEvent { result: serde_json::json!({"sentiment": "negative", "text": event.text}) })
}
```

## How It Works

- The router matches event types to steps
- Only one branch executes based on which event is emitted
- First `StopEvent` terminates the workflow

---

# Local Inference

Source: https://blazen.dev/docs/guides/wasm/local_inference
Language: wasm
Section: guides

The Blazen WASM SDK can run AI models entirely in the browser -- no API key required after the initial model download. This guide covers three options:

1. **Built-in `TractEmbedModel`** -- ONNX embeddings shipped directly inside the SDK, no extra JS dependency.
2. **transformers.js** -- bring-your-own embedding pipeline via `EmbeddingModel.fromJsHandler()`.
3. **WebLLM** -- WebGPU-accelerated chat models via `CompletionModel.fromJsHandler()`.

## What is possible

`TractEmbedModel.create()` runs ONNX models inside the WASM module via the [tract](https://github.com/sonos/tract) inference engine. The `EmbeddingModel.fromJsHandler()` and `CompletionModel.fromJsHandler()` factories let you plug any JavaScript inference library into the Blazen pipeline. The model runs on the user's device while Blazen's `Memory`, `CompletionModel.withFallback()`, `withRetry()`, and `withCache()` work exactly the same as they do with hosted APIs.

Use cases:

- **Offline-first apps** -- search and chat without a network connection
- **Privacy-sensitive data** -- embeddings and queries never leave the device
- **Zero marginal cost** -- no per-token charges after the model is cached
- **Hybrid patterns** -- fast local embeddings for search, hosted API for generation

## Browser compatibility (April 2026)

Local inference relies on WebGPU (for LLMs) and WebAssembly (for embeddings). Current support:

| Browser | WebGPU since | Notes |
|---|---|---|
| Chrome / Edge | 113 (May 2023) | Full support |
| Safari | 26 (Sep 2025) | Including iOS and iPadOS |
| Firefox | 141 Windows / 145 macOS ARM | Linux support in progress |

Approximately 65% of global users have WebGPU support. Embeddings via WASM (CPU) work in all modern browsers regardless of WebGPU.

## Built-in embeddings with `TractEmbedModel`

`TractEmbedModel` runs ONNX embedding models directly inside the WASM module -- no extra npm dependency, no `pipeline()` call. The model and tokenizer are fetched from any URL (typically Hugging Face) using `web_sys::fetch`, so the URLs must be reachable with browser-compatible CORS.

### Installation

```bash
npm install @blazen/sdk
```

### Usage

```typescript
import { TractEmbedModel, init } from "@blazen/sdk";

await init();

const model = await TractEmbedModel.create(
  "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx",
  "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer.json",
);

const result = await model.embed(["hello", "world"]);
console.log(result.embeddings); // (number[])[]
console.log(model.dimensions);  // e.g. 384
```

The returned `embeddings` is a `number[][]` (one vector per input). `model.dimensions` reports the model's output size after the first inference.

### When to pick `TractEmbedModel` vs transformers.js

| Concern | `TractEmbedModel` | transformers.js |
|---|---|---|
| Dependencies | Just `@blazen/sdk` | `+ @huggingface/transformers` |
| Model fetch | Direct `web_sys::fetch` of ONNX + tokenizer URLs | Hugging Face Hub via JS loader |
| GPU acceleration | CPU only (tract WASM) | WebGPU when available, WASM fallback |
| Bundle impact | None beyond the SDK | ~1.2 MB JS + ~3.5 MB ORT WASM lazy-loaded |

Use `TractEmbedModel` when you want the smallest bundle and full control of the model URL. Use transformers.js when you want WebGPU acceleration or richer pre/post-processing.

### CORS requirement

Because the SDK fetches the ONNX file from the browser, the host must serve `Access-Control-Allow-Origin` headers that permit your origin. Hugging Face's `resolve/main` URLs already do this. If you self-host weights, configure CORS on your CDN.

## Local embeddings with transformers.js

[transformers.js v4](https://huggingface.co/docs/transformers.js) runs Hugging Face models in the browser via ONNX Runtime. For embeddings, it uses WebAssembly SIMD on the CPU -- no GPU required.

### Installation

```bash
npm install @blazen/sdk @huggingface/transformers
```

### Usage

```typescript
import init, { EmbeddingModel, Memory } from '@blazen/sdk';
import { pipeline } from '@huggingface/transformers';

await init();

// Load the transformers.js feature-extraction pipeline
const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Wrap it as a Blazen EmbeddingModel
const embedder = EmbeddingModel.fromJsHandler(
  'all-MiniLM-L6-v2',
  384,    // dimensions for this model
  async (texts) => {
    const output = await pipe(texts, { pooling: 'mean', normalize: true });
    return Array.from({ length: texts.length }, (_, i) => {
      const row = output[i];
      return row.data instanceof Float32Array
        ? row.data
        : new Float32Array(row.data);
    });
  },
);

// Use it with Memory for semantic search
const memory = new Memory(embedder);
await memory.add('doc1', 'Paris is the capital of France');
await memory.add('doc2', 'Rust is a systems programming language');

const results = await memory.search("What is France's capital?", 5);
console.log(results[0].text); // "Paris is the capital of France"
```

### The `fromJsHandler` API

```typescript
EmbeddingModel.fromJsHandler(
  modelId: string,      // identifier for logging / display
  dimensions: number,   // vector dimensionality (e.g. 384 for MiniLM)
  handler: (texts: string[]) => Promise<Float32Array[]> | Float32Array[]
): EmbeddingModel
```

The handler receives an array of strings and must return one `Float32Array` per input text, each with exactly `dimensions` elements.

### Performance

| Backend | Latency (single query) | Notes |
|---|---|---|
| WASM (CPU) | ~170 ms | Works everywhere, no WebGPU needed |
| WebGPU | ~35 ms | Chrome 113+, Safari 26+, Firefox 141+ |

transformers.js automatically uses WebGPU when available and falls back to WASM.

### Bundle size

| Component | Size | When loaded |
|---|---|---|
| `@huggingface/transformers` | ~1.2 MB | Import time |
| ONNX Runtime WASM | ~3.5 MB | Lazy, on first inference |
| Model weights | ~23 MB | Cached in browser after first download |

The model weights are cached in the browser's Cache API. Subsequent page loads skip the download.

## Local LLM with WebLLM

[WebLLM](https://webllm.mlc.ai/) runs large language models on WebGPU. It compiles models to the user's specific GPU at first load, then caches the compiled shaders for near-instant subsequent starts.

### Installation

```bash
npm install @blazen/sdk @mlc-ai/web-llm
```

### Usage

```typescript
import init, { CompletionModel, ChatMessage } from '@blazen/sdk';
import * as webllm from '@mlc-ai/web-llm';

await init();

// Create the WebLLM engine (downloads + compiles on first visit)
const engine = await webllm.CreateMLCEngine(
  'Llama-3.2-1B-Instruct-q4f16_1-MLC',
  {
    initProgressCallback: (p) => console.log(p.text),
  },
);

// Wrap it as a Blazen CompletionModel
const model = CompletionModel.fromJsHandler(
  'Llama-3.2-1B-Instruct',
  async (request) => {
    const messages = (request.messages || []).map((m) => ({
      role: m.role,
      content: typeof m.content === 'string'
        ? m.content
        : m.content?.text || '',
    }));

    const reply = await engine.chat.completions.create({
      messages,
      temperature: request.temperature ?? 0.7,
      max_tokens: request.max_tokens ?? 512,
    });

    return {
      content: reply.choices?.[0]?.message?.content || '',
      toolCalls: [],
      citations: [],
      artifacts: [],
      images: [],
      audio: [],
      videos: [],
      model: 'Llama-3.2-1B-Instruct',
      metadata: {},
    };
  },
);

// Use it like any other Blazen model
const response = await model.complete([
  ChatMessage.user('Explain WebAssembly in one sentence.'),
]);
console.log(response.content);
```

### The `fromJsHandler` API

```typescript
CompletionModel.fromJsHandler(
  modelId: string,
  completeHandler: (request: CompletionRequest) => Promise<CompletionResponse>,
  streamHandler?: (request: CompletionRequest, onChunk: (chunk: StreamChunk) => void) => Promise<void>
): CompletionModel
```

The `completeHandler` receives a `CompletionRequest`-shaped object and must return a `CompletionResponse`-shaped object. The optional `streamHandler` enables token streaming; if omitted, `model.stream()` falls back to calling `completeHandler` and yielding the result as a single chunk.

### Practical model sizes

Stick to 1B-3B parameter models for a usable browser experience:

| Model | Download | Cold start | Tokens/sec |
|---|---|---|---|
| Llama-3.2-1B-Instruct (q4f16) | ~600 MB | ~30 s | 40-60 |
| Llama-3.2-3B-Instruct (q4f16) | ~1.8 GB | ~60 s | 15-30 |
| Llama-3.1-8B-Instruct (q4f16) | ~4.5 GB | 2-3 min | 5-10 |

Models at 7B+ parameters require 4+ GB of GPU memory and have cold starts measured in minutes. They are not recommended unless you know your users have high-end hardware.

## Always ship a hosted API fallback

Not every user has WebGPU. Even those who do may be on low-end hardware or a phone with insufficient memory. The recommended production pattern is to try local inference first and fall back to a cloud API:

```typescript
import init, { CompletionModel, ChatMessage } from '@blazen/sdk';

await init();

let localModel;
try {
  // Try to create local model (WebLLM, etc.)
  localModel = CompletionModel.fromJsHandler('local', localHandler);
} catch {
  // WebGPU not available or model too large
}

const apiModel = CompletionModel.openrouter(); // reads OPENROUTER_API_KEY

// If local model loaded, try it first; if it fails, use the API.
// If local model did not load, use the API directly.
const model = localModel
  ? CompletionModel.withFallback([localModel, apiModel])
  : apiModel;

const response = await model.complete([ChatMessage.user('Hello!')]);
```

`CompletionModel.withFallback()` is a static method that takes an array of models and tries them in order. If the first model throws a retryable error, the next is tried. Non-retryable errors (auth failures, invalid input) short-circuit.

## In-browser RAG

Combine local embeddings with `Memory` for full retrieval-augmented generation that runs entirely on the device. The simplest in-browser store is `InMemoryBackend`, which keeps vectors in a `Map` for the lifetime of the page:

```typescript
import { Memory, InMemoryBackend, TractEmbedModel } from "@blazen/sdk";

const embedder = await TractEmbedModel.create(modelUrl, tokenizerUrl);
const memory = Memory.fromBackend(embedder, new InMemoryBackend());

await memory.upsert([{ id: "doc1", content: "..." }]);
const results = await memory.query("question", 5);
```

`Memory.fromBackend(embedder, backend)` is the wasm equivalent of the native `MemoryBuilder` -- it pairs any embedding model with any backend that implements the memory backend trait. `InMemoryBackend` is ideal for ephemeral session state and demos; swap in a persistent backend (IndexedDB-backed, server-side, etc.) when the session needs to survive a reload.

### Full RAG pipeline with the built-in embedder

```typescript
import init, { Memory, InMemoryBackend, TractEmbedModel, CompletionModel, ChatMessage } from '@blazen/sdk';

await init();

const embedder = await TractEmbedModel.create(
  'https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx',
  'https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer.json',
);
const memory = Memory.fromBackend(embedder, new InMemoryBackend());

await memory.upsert([
  { id: 'faq-1', content: 'Refunds are processed within 5-7 business days.' },
  { id: 'faq-2', content: 'Free shipping on orders over $50.' },
  { id: 'faq-3', content: 'Contact support at help@example.com.' },
]);

async function answerQuestion(question: string) {
  const context = await memory.query(question, 3);
  const contextText = context.map((r) => r.content).join('\n');
  const llm = CompletionModel.openrouter();
  const response = await llm.complete([
    ChatMessage.system(`Answer using only this context:\n${contextText}`),
    ChatMessage.user(question),
  ]);
  return response.content;
}
```

The embedding + similarity search runs locally. Only the final generation call hits the API, and even that can be replaced with a local WebLLM model.

### Same pipeline with a transformers.js embedder

```typescript
import init, { EmbeddingModel, Memory, InMemoryBackend, CompletionModel, ChatMessage } from '@blazen/sdk';
import { pipeline } from '@huggingface/transformers';

await init();

const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embedder = EmbeddingModel.fromJsHandler('MiniLM', 384, async (texts) => {
  const output = await pipe(texts, { pooling: 'mean', normalize: true });
  return Array.from({ length: texts.length }, (_, i) => {
    const row = output[i];
    return row.data instanceof Float32Array ? row.data : new Float32Array(row.data);
  });
});

const memory = Memory.fromBackend(embedder, new InMemoryBackend());
await memory.upsert([{ id: 'doc1', content: 'Paris is the capital of France' }]);
const results = await memory.query("What is France's capital?", 5);
```

Either embedder slots into the same `Memory.fromBackend(...)` call -- pick whichever fits your bundle and acceleration constraints.

## Managing memory pressure with `ModelManager`

`ModelManager` wraps the same `blazen_manager::ModelManager` used by the native runtime -- an LRU that evicts the least-recently-used model when the byte budget is exceeded. In a long-running browser session that loads several Tract models (e.g. one embedder, one reranker, one classifier), `ModelManager` prevents the page from accumulating unbounded GPU/heap memory.

The JS-facing API matches the native one:

| Method | Purpose |
|---|---|
| `register(id, model, vramEstimateBytes, lifecycle)` | Declare a model with `{ load, unload }` lifecycle hooks |
| `unregister(id)` | Drop a registration (and any loaded weights) |
| `load(id)` | Load (or reuse) and mark as MRU |
| `unload(id)` | Free without unregistering |
| `isLoaded(id)` | Check whether weights are currently resident |
| `usedBytes` / `availableBytes` / `budgetBytes` | Budget telemetry (getters, no parens) |
| `status()` | Snapshot of every registration |

```typescript
import init, { ModelManager, TractEmbedModel } from '@blazen/sdk';

await init();

const manager = new ModelManager(0.5); // 0.5 GB (512 MB) budget -- constructor takes gigabytes

manager.register('mini-lm', null, 90 * 1024 * 1024, {
  load: async () => TractEmbedModel.create(miniLmOnnx, miniLmTokenizer),
  unload: async () => {},
});
manager.register('bge-small', null, 130 * 1024 * 1024, {
  load: async () => TractEmbedModel.create(bgeOnnx, bgeTokenizer),
  unload: async () => {},
});

const embedder = await manager.load('mini-lm');  // loads + marks MRU
console.log(manager.usedBytes, '/', manager.budgetBytes); // getters, no parens
```

When a `load()` call would push `usedBytes()` past `budgetBytes()`, the manager evicts the LRU entry first. Use it whenever your app might keep more than one Tract model alive at a time.

## COOP/COEP trap

Some guides for WASM threading recommend adding these HTTP headers:

```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
```

**Do not add these headers** unless you specifically need `SharedArrayBuffer` for WASM SIMD multi-threading. They break:

- Stripe, PayPal, and other payment iframes
- YouTube / Vimeo embeds
- Google OAuth popups
- Most third-party ad scripts

transformers.js and WebLLM work without these headers. They use single-threaded WASM or WebGPU, neither of which requires `SharedArrayBuffer`.

If you do need SIMD threads (rare), scope the headers to a dedicated `/inference` path with a service worker rather than applying them site-wide.

## iOS Safari warning

WebGPU in Safari 26+ works on iOS and iPadOS, but iPhones have limited unified memory:

| Device | RAM | Can run 1B LLM? | Can run 3B LLM? |
|---|---|---|---|
| iPhone 15 Pro / 16 Pro | 8 GB | Yes | Marginal |
| iPhone 15 / 16 | 6 GB | Marginal | No |
| iPhone 14 and earlier | 6 GB or less | No | No |

Embedding models (~23 MB) work fine on all iPhones. For LLMs, detect available memory or WebGPU adapter limits before attempting to load a model, and fall back to a hosted API.

## Troubleshooting

### `@huggingface/transformers` not found

The import must resolve at runtime. If you are not using a bundler, add an import map:

```html
<script type="importmap">
{
  "imports": {
    "@huggingface/transformers": "https://cdn.jsdelivr.net/npm/@huggingface/transformers"
  }
}
</script>
```

### WebGPU initialization failure

```
Error: WebGPU is not supported in this browser
```

The user's browser does not support WebGPU. This affects LLM inference via WebLLM but not embeddings via transformers.js (which uses WASM). Use the fallback pattern described above.

### Model download stalls

Large models (3B+) require a stable connection for the initial download. If the download stalls, the browser's Cache API may have a corrupt entry. Clear site data and retry, or switch to a smaller model.

### `SharedArrayBuffer is not defined`

This error usually means COOP/COEP headers are misconfigured. See the COOP/COEP section above. If you are using a library that requires `SharedArrayBuffer`, verify that both headers are set correctly:

```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
```

But prefer libraries that do not require them.

### Slow first inference

The first inference after page load is always slower because:

1. **transformers.js**: The ONNX Runtime WASM module (~3.5 MB) loads lazily on first call.
2. **WebLLM**: The model is compiled to GPU shaders on first use (~30-60 seconds).

Subsequent inferences are fast. For transformers.js, you can warm up the pipeline at load time:

```typescript
// Warm up: embed an empty string to trigger WASM loading
await embedder.embed(['']);
```

## Next steps

- See the [Local RAG Browser Example](https://github.com/ZachHandley/Blazen/tree/main/crates/blazen-wasm-sdk/examples/local_rag_browser) for a complete working demo.
- See the [Local Chat Browser Example](https://github.com/ZachHandley/Blazen/tree/main/crates/blazen-wasm-sdk/examples/local_chat_browser) for an on-device chat UI.
- Combine local inference with [WASM Workflows](/docs/guides/wasm/workflows) for multi-step pipelines.
- Add resilience with [Middleware & Composition](/docs/guides/wasm/middleware) (retry, cache, fallback).

---

# Human-in-the-Loop

Source: https://blazen.dev/docs/guides/node/human-in-the-loop
Language: node
Section: guides

## Side-Effect Steps

Return `null` from a step and use `ctx.sendEvent()` for manual routing. This lets you insert asynchronous human review, webhook callbacks, or queue polling before the workflow continues:

```javascript
wf.addStep("review", ["ReadyForReview"], async (event, ctx) => {
  await ctx.set("needs_approval", true);
  // In production: wait for webhook, poll queue, etc.
  const approved = simulateHumanReview(event);
  await ctx.set("approved", approved);
  await ctx.sendEvent({ type: "ReviewComplete" });
  return null;
});
```

**Important:** `ctx.sendEvent()`, `ctx.set()`, and `ctx.get()` are all async — always use `await`.

## Pause and Resume

For long-running human tasks, serialize the workflow state and resume it later:

```javascript
const handler = await wf.runWithHandler(input);
const snapshot = handler.pause(); // Serialize workflow state

// Later...
const handler2 = await wf.resume(snapshot);
const result = await handler2.result();
```

This lets you persist the snapshot to a database or message queue and pick the workflow back up in a separate process or after a deployment.

## Responding to Input Requests

When a step emits an `InputRequestEvent`, the workflow auto-pauses and surfaces the event on the handler's stream. Subscribe to it and call `handler.respondToInput(requestId, response)` to inject the answer; the workflow resumes with that value as the step result:

```typescript
const handler = await wf.runWithHandler(input);

await handler.streamEvents((event) => {
  if (event.type === "blazen::InputRequestEvent") {
    const answer = await prompt(event.prompt);
    handler.respondToInput(event.request_id, answer);
  }
});

const result = await handler.result();
```

The streamed event uses serde's snake_case fields, so read `event.request_id`, `event.prompt`, and `event.metadata` directly off the object. The `requestId` you pass to `respondToInput` is the camelCase napi argument name, but the value itself is whatever string the step assigned to `request_id`.

If you already constructed an `InputResponseEvent` object elsewhere, prefer `respondToInputTyped(event)` -- it accepts the typed `{ requestId, response }` shape and skips the per-field positional call:

```typescript
handler.respondToInputTyped({
  requestId: event.request_id,
  response: { approved: true, note: "looks good" },
});
```

Both methods borrow the handler (they do **not** consume it), so you can answer multiple input requests over the lifetime of one workflow run before finally calling `handler.result()`.

The same `respondToInput(requestId, response)` API is also exposed on `WasmWorkflowHandler` in the `blazen-wasm-sdk` package, so browser HITL flows use an identical pattern -- see the WASM workflows guide for the in-browser variant.

:::caution[ctx.session and pause/resume]
Values stored via `ctx.session.set(...)` are deliberately **excluded** from snapshots. Use `ctx.state.set(...)` for anything that must survive `pause()` / `resume()`, and `ctx.session.set(...)` for ephemeral values (request IDs, rate-limit counters, caches).

The workflow's `session_pause_policy` governs what happens to session entries at pause time:

- **`pickle_or_error`** (default) -- attempt to pickle each session entry into the snapshot; raise a clear error if any entry can't be serialised.
- **`warn_drop`** -- drop session entries from the snapshot and emit a warning. For ephemeral runs.
- **`hard_error`** -- refuse to pause if any session entries are in flight.

**Also note:** on Node, JS object identity through `ctx.session` is **not** preserved -- session values are routed through `serde_json::Value` because napi-rs's `Reference<T>` is `!Send` (its `Drop` must run on the v8 main thread). `await ctx.session.get("k")` returns a plain object equal to the one you passed in, not the same object. For true identity preservation of live JS objects across steps, use the Python or WASM bindings.
:::

---

# Human-in-the-Loop

Source: https://blazen.dev/docs/guides/python/human-in-the-loop
Language: python
Section: guides

## Side-Effect Steps

A step can return `None` and use `ctx.send_event()` to manually route events:

```python
class ReviewComplete(Event):
    pass

@step
async def review(ctx: Context, ev: Event):
    ctx.set("needs_approval", True)
    approved = simulate_human_review(ev)
    ctx.set("approved", approved)
    ctx.send_event(ReviewComplete())
    return None
```

Both `ctx.send_event()` and `ctx.set()` are synchronous.

## Pause and Resume

```python
handler = await wf.run(data="some input")
snapshot = handler.pause()

# Later...
handler = await Workflow.resume(snapshot)
result = await handler.result()
```

:::caution[ctx.session and pause/resume]
If you store live in-process objects in `ctx.session` (DB connections, file handles, sockets), they are deliberately **excluded** from snapshots. The workflow's `session_pause_policy` governs what happens at pause time:

- **`pickle_or_error`** (default) -- attempt to pickle each live ref into the snapshot; raise a clear error if any entry can't be serialised.
- **`warn_drop`** -- drop live refs from the snapshot and emit a warning. For ephemeral runs.
- **`hard_error`** -- refuse to pause if any live refs are in flight.

The practical rule: put anything that **must** survive `pause()` / `resume()` in `ctx.state`, and everything else in `ctx.session`.
:::

---

# Human-in-the-Loop

Source: https://blazen.dev/docs/guides/rust/human-in-the-loop
Language: rust
Section: guides

## InputRequestEvent and InputResponseEvent

When a step needs human input, it emits an `InputRequestEvent`. The workflow engine automatically pauses execution and waits for a matching `InputResponseEvent` before continuing. Events are correlated by `request_id`.

```rust
use blazen::prelude::*;

#[step]
async fn review(event: AnalyzeEvent, _ctx: Context) -> Result<InputRequestEvent, WorkflowError> {
    Ok(InputRequestEvent {
        request_id: uuid::Uuid::new_v4().to_string(),
        prompt: format!("Approve analysis for '{}'?", event.text),
        payload: serde_json::json!({ "score": event.score }),
    })
}
```

## Input Handlers

Register an input handler when building the workflow. The handler receives each `InputRequestEvent` and must return an `InputResponseEvent` with the matching `request_id`:

```rust
let workflow = WorkflowBuilder::new("assistant")
    .step(ai_step_registration())
    .step(review_step_registration())
    .input_handler(Arc::new(|request| Box::pin(async move {
        println!("AI asks: {}", request.prompt);
        let answer = get_user_input().await;
        Ok(InputResponseEvent {
            request_id: request.request_id,
            response: serde_json::json!(answer),
        })
    })))
    .build()?;
```

The handler is invoked each time any step emits an `InputRequestEvent`. After the handler returns, the workflow resumes with the response routed to the next matching step.

## Pause and Resume

For long-running approvals, workflows support durable pause and resume via snapshots:

```rust
let handler = workflow.run(input).await?;

// Pause the workflow, then take a snapshot
handler.pause()?;
let snapshot = handler.snapshot().await?;
save_to_database(&snapshot).await?;

// Later -- restore and supply the response
let handler = Workflow::resume(snapshot)?;
handler.respond_to_input(pending_request_id, serde_json::json!({ "approved": true }));
handler.resume_in_place();

let result = handler.result().await?;
```

Snapshots capture all in-flight state, so workflows survive process restarts and can be resumed from any instance.

---

# Multimodal Content

Source: https://blazen.dev/docs/guides/wasm/multimodal
Language: wasm
Section: guides

This guide covers the multimodal content subsystem in `@blazen/sdk` -- the WebAssembly build of Blazen that runs in browsers, Cloudflare Workers, Deno, Vercel Edge, Fastly Compute, and any other host with a WASM runtime. The same `ContentStore`, `ContentHandle`, `ImageSource`, and `*Input` schema helpers ship in this package as in the Node binding, with one big difference: there is no filesystem.

If you have not initialized the SDK yet, start with the [WASM Quickstart](/docs/guides/wasm/quickstart) and call `init()` once before constructing a `ContentStore`. See the [WASM API reference](/docs/api/wasm) for the full export list.

## Why content handles?

Models do not stream raw bytes around. They want a URL the provider can fetch, a provider-side file id, or a base64 blob inlined into the request. Each provider expects a different envelope. Each one has different size limits.

A `ContentHandle` is Blazen's neutral pointer to a piece of content. You hand bytes (or a URL) to a `ContentStore`, you get back a handle, and you put the handle wherever you would have put a URL. At wire time the resolver picks the best concrete representation for the provider it is talking to -- a hosted URL when one is available, a provider file id when the store is a provider-files store, otherwise base64.

Tools take handles too. The `imageInput`, `audioInput`, `videoInput`, `fileInput`, `threeDInput`, and `cadInput` helpers emit JSON Schema fragments tagged with `x-blazen-content-ref`, and Blazen substitutes the resolved typed content before your handler runs. The model never sees the tag -- providers just see a `string` parameter -- but your handler receives the full handle metadata.

## What's different in the browser?

The WASM `ContentStore` exposes a deliberately smaller surface than the native Rust crate or the Node binding:

- **No `localFile()` factory.** Browsers do not have a synchronous filesystem. If you need to load a file the user picked, read it with the `File` / `Blob` Web APIs and `put()` the resulting `Uint8Array` into the in-memory store, or upload it to a provider-files store.
- **No `metadata()` method.** The four data methods are `put`, `resolve`, `fetchBytes`, and `delete` -- plus `free()` and `[Symbol.dispose]` for explicit cleanup. There is no separate metadata accessor; the `ContentHandle` returned from `put()` already carries `kind`, `mime_type`, `byte_size`, and `display_name`.
- **In-memory bytes live in WASM linear memory.** Putting a 100 MB video into `ContentStore.inMemory()` consumes 100 MB of WASM heap. Use a provider-files store (or your own URL) for anything large.
- **Binary I/O is `Uint8Array`.** `put()` accepts a `Uint8Array` for byte uploads or a `string` for URL inputs. `fetchBytes()` returns a `Uint8Array`.
- **Provider stores work fine.** `openaiFiles`, `anthropicFiles`, `geminiFiles`, and `falStorage` all use the platform `fetch`, so they run unchanged in any WASM host that exposes `fetch` -- Cloudflare Workers, Deno, browsers, Vercel Edge, etc.
- **Custom backends work two ways.** Either `ContentStore.custom({ put, resolve, fetchBytes, ... })` for a callback-based factory, or `class MyStore extends ContentStore` for a real subclass. Both routes are wired through the same Rust adapter -- see [Custom backends](#custom-backends) below.
- **Always call `init()` first.** The static factories (`ContentStore.inMemory()` and friends) require the WASM module to be instantiated.

## `ContentKind`

Every handle carries a `ContentKind` -- a string union tag the resolver and tool-input validator use to route content to the right place.

| Value | Typical use |
|---|---|
| `"image"` | PNG, JPEG, WebP, GIF, etc. |
| `"audio"` | MP3, WAV, FLAC, OGG, transcription input |
| `"video"` | MP4, WebM, MOV |
| `"document"` | PDF, DOCX, plain text |
| `"three_d_model"` | GLB, GLTF, OBJ, FBX |
| `"cad"` | STEP, IGES, STL, native CAD formats |
| `"archive"` | ZIP, TAR, 7z |
| `"font"` | TTF, OTF, WOFF |
| `"code"` | Source files |
| `"data"` | JSON, CSV, Parquet |
| `"other"` | Anything else |

Pass the wire form (e.g. `"three_d_model"`, not `"3d"`) when supplying a `kindHint` to `put()`. Omit it to let the store auto-detect from the bytes or MIME hint.

## `ContentStore`

Every `ContentStore` is constructed via a static factory. The constructor itself is private; you cannot `new ContentStore()`.

```typescript
import init, { ContentStore } from "@blazen/sdk";
import type { ContentHandle } from "@blazen/sdk";

await init();

const store = ContentStore.inMemory();

// Upload bytes. The handle carries the metadata back.
const bytes: Uint8Array = await fetch("/sample.png").then((r) => r.arrayBuffer()).then((b) => new Uint8Array(b));

const handle: ContentHandle = await store.put(
  bytes,
  "image",          // kindHint -- omit to auto-detect
  "image/png",      // mimeType hint
  "sample.png",     // displayName
);

console.log(handle.id, handle.kind, handle.mime_type, handle.byte_size, handle.display_name);
```

The handle fields are tsify-generated and preserve Rust snake_case: `id`, `kind`, `mime_type?`, `byte_size?`, `display_name?`.

You can also `put()` a public URL as a string -- the in-memory store records it by reference instead of copying bytes:

```typescript
const remote = await store.put(
  "https://example.com/photo.jpg",
  "image",
  "image/jpeg",
  "photo.jpg",
);
```

Resolve a handle to a wire-renderable source, fetch its bytes back, or delete it:

```typescript
// Resolve to the concrete shape providers see (URL, base64, provider-file ref, etc.).
const source = await store.resolve(handle);
console.log(source); // { type: "url", url: "..." } or { type: "base64", data: "..." }

// Pull bytes back. Reference-only entries (URLs in the in-memory store) reject.
const roundTrip: Uint8Array = await store.fetchBytes(handle);

// Drop it. Idempotent on stores that track lifetime; a no-op elsewhere.
await store.delete(handle);
```

Free the WASM-side handle when you are done. Either pattern works:

```typescript
// Explicit
store.free();

// Or use `using` (TypeScript 5.2+ / runtimes with Symbol.dispose).
{
  using s = ContentStore.inMemory();
  const h = await s.put(bytes, "image");
  // ... use s ...
} // s.free() runs automatically here
```

## Built-in stores

| Factory | Backed by | API key | Notes |
|---|---|---|---|
| `ContentStore.inMemory()` | WASM linear memory | -- | Bytes copied into WASM heap; URLs recorded by reference |
| `ContentStore.openaiFiles(apiKey)` | OpenAI Files API | `OPENAI_API_KEY` | `fetch`-based; runs in any WASM host with `fetch` |
| `ContentStore.anthropicFiles(apiKey)` | Anthropic Files API | `ANTHROPIC_API_KEY` | Sent as `x-api-key` header |
| `ContentStore.geminiFiles(apiKey)` | Google AI / Gemini Files | `GEMINI_API_KEY` | |
| `ContentStore.falStorage(apiKey)` | fal.ai Storage | `FAL_KEY` | Returns hosted URLs; ideal for video / image-gen pipelines |

There is intentionally no `localFile()` factory in the WASM SDK -- the browser has no synchronous filesystem. If you need to ingest a user-picked file, read it with the `File` API:

```typescript
const file = (document.querySelector("input[type=file]") as HTMLInputElement).files![0];
const bytes = new Uint8Array(await file.arrayBuffer());
const handle = await store.put(bytes, undefined, file.type, file.name);
```

## `ImageSource` / handle on the wire

`resolve()` returns a `MediaSource` -- a discriminated union (aliased as `ImageSource`) that covers every shape a provider might want:

```typescript
import type { ImageSource } from "@blazen/sdk";

type ImageSource =
  | { type: "url"; url: string }
  | { type: "base64"; data: string }
  | { type: "file"; path: string }
  | { type: "provider_file"; provider: ProviderId; id: string }
  | { type: "handle"; handle: ContentHandle };
```

Note that `{ type: "file"; path: string }` is preserved for shape compatibility with the native Rust crate -- it is meaningful for local-only providers (whisper.cpp, diffusers) and is not produced by browser-side stores.

When a handle is serialized into a request, Blazen prefers representations in this order:

1. **URL** -- already-hosted content, no extra round-trip for the provider.
2. **Provider file** -- when the store is the same provider's files API (e.g. an `openaiFiles` handle going into an OpenAI completion).
3. **Base64** -- last-resort inline encoding for raw byte stores.

You normally do not pick the variant yourself; pass the handle and let the resolver choose. The `{ type: "handle"; ... }` variant exists so handles can travel through messages before being collapsed.

## Tool inputs

Tool registrations in the WASM SDK use the same `{ name, description, parameters, handler }` object shape documented in the [WASM Agent guide](/docs/guides/wasm/agent), and tools are passed directly to `runAgent` (or `runAgentWithCallback`). The `imageInput`, `audioInput`, `videoInput`, `fileInput`, `threeDInput`, and `cadInput` helpers build the `parameters` schema for you:

```typescript
import init, { CompletionModel, ChatMessage, runAgent, ContentStore, imageInput } from "@blazen/sdk";

await init();

const model = CompletionModel.openai();
const store = ContentStore.inMemory();

const photoBytes = new Uint8Array(await (await fetch("/photo.jpg")).arrayBuffer());
const handle = await store.put(photoBytes, "image", "image/jpeg", "photo.jpg");

const tools = [
  {
    name: "describePhoto",
    description: "Describe what is in the supplied photo.",
    parameters: imageInput("photo", "The photo to analyze"),
    handler: async (args: { photo: any }) => {
      // `args.photo` has been resolved by Blazen from a handle id string into
      // a typed content object: { kind, handleId, mimeType, byteSize, displayName, source }.
      console.log("resolved", args.photo.kind, args.photo.mimeType);
      return { description: `Saw a ${args.photo.kind} (${args.photo.byteSize ?? "?"} bytes)` };
    },
  },
];

const result = await runAgent(
  model,
  [ChatMessage.user(`Describe the photo with id ${handle.id}.`)],
  tools,
  { maxIterations: 3 },
);

console.log(result.content);
```

Each `*Input` helper returns a JSON Schema fragment of the form:

```typescript
{
  type: "object",
  properties: {
    [name]: {
      type: "string",
      description,
      "x-blazen-content-ref": { kind: "image" }
    }
  },
  required: [name]
}
```

The `x-blazen-content-ref` extension is invisible to providers (they ignore unknown JSON Schema keys), but Blazen's resolver intercepts the property, looks the handle up in the active `ContentStore`, and replaces the bare id string with the typed content payload before your handler executes. If the handle's `kind` does not match the helper (e.g. an audio handle into `imageInput`), the call is rejected before the handler runs.

## Tool results with multimodal

Tool handlers can return multimodal payloads back to the model by setting `llmOverride` on a `ToolOutput` literal with a `kind: "parts"` `LlmPayload` (or, for Anthropic, native multimodal parts). Cross-provider serialization of that payload is handled by Blazen -- non-Anthropic providers receive the override as a follow-up user message. See the cross-cutting [Multimodal Tool Results](/docs/guides/tool-multimodal) guide for the full pattern.

## Cloudflare Worker example

A Worker is just a fetch handler, but the same `init()` + `ContentStore` flow applies. Provider-files stores work unmodified because they only need `fetch`. Use the in-memory store for ephemeral blobs received in the request, and an OpenAI files store for anything you want pinned for reuse across iterations.

```typescript
import init, {
  CompletionModel,
  ChatMessage,
  ContentStore,
  runAgent,
  imageInput,
} from "@blazen/sdk";

let ready: Promise<void> | null = null;

export default {
  async fetch(request: Request, env: { OPENAI_API_KEY: string }): Promise<Response> {
    ready ??= Promise.resolve(init());
    await ready;

    // Two stores: ephemeral in-memory for this request, persistent OpenAI Files
    // for anything we want re-used across the agent loop.
    using ephemeral = ContentStore.inMemory();
    using persistent = ContentStore.openaiFiles(env.OPENAI_API_KEY);

    const upload = new Uint8Array(await request.arrayBuffer());
    const handle = await persistent.put(upload, "image", request.headers.get("content-type") ?? undefined);

    const model = CompletionModel.openai();
    const tools = [
      {
        name: "describe",
        description: "Describe the uploaded image.",
        parameters: imageInput("photo", "The photo the user just uploaded"),
        handler: async (args: { photo: any }) => ({
          summary: `${args.photo.kind} (${args.photo.mimeType ?? "unknown"})`,
        }),
      },
    ];

    const result = await runAgent(
      model,
      [ChatMessage.user(`Describe the photo with id ${handle.id}.`)],
      tools,
      { maxIterations: 3 },
    );

    return new Response(result.content ?? "", { headers: { "content-type": "text/plain" } });
  },
};
```

For wrangler config, deployment, and the rest of the Workers story, see the [Edge Deployment](/docs/guides/wasm/deployment) guide. The `using` declarations require Workers runtime with `Symbol.dispose` support; otherwise call `ephemeral.free()` and `persistent.free()` explicitly in a `finally` block.

## Custom backends

If none of the built-in factories fit -- you want IndexedDB, OPFS, a private S3 bucket, an internal CDN, an in-memory cache with custom eviction, etc. -- you can plug your own `ContentStore` implementation in two equivalent ways. Both end up wrapped behind the same Rust adapter, so the agent loop, tool resolver, and provider serializers see identical behavior either way.

The required surface is `put`, `resolve`, and `fetchBytes`. `fetchStream` and `delete` are optional -- if you omit `fetchStream` Blazen falls back to `fetchBytes` (same shape as the Rust trait's default impl), and if you omit `delete` it becomes a no-op.

### Path A -- callback factory

`ContentStore.custom({ ... })` mirrors the Rust `CustomContentStore::builder` and the equivalent factories on the Node and Python bindings. Hand it an options object with async callbacks:

```typescript
import { ContentStore } from "@blazen/sdk";
import type { ContentHandle } from "@blazen/sdk";

const store = ContentStore.custom({
  // body is { type: "bytes"; data: number[] } | { type: "url"; url: string }
  //       | { type: "provider_file"; provider: string; id: string }.
  // (No `local_path` variant in WASM -- the browser has no filesystem.)
  // hint is { kind?, mime_type?, display_name? } -- all optional.
  put: async (body, hint) => {
    // ...persist to your backend...
    return {
      id: "blazen_xxx",
      kind: "image",
      mime_type: "image/png",
    } satisfies ContentHandle;
  },
  resolve: async (handle) => ({ type: "url", url: "https://example.com/blob.png" }),
  fetchBytes: async (handle) => new Uint8Array([0xDE, 0xAD]),
  // Optional:
  fetchStream: async (handle) => new Uint8Array([0xBE, 0xEF]),
  delete: async (handle) => { /* no-op */ },
});
```

The callbacks may be plain `async` functions or any function that returns a `Promise`. `put` receives the `body` and `hint` already serialized to plain JS objects (snake_case Rust shape via `serde_wasm_bindgen`). `resolve` must return a `MediaSource`-shaped object (`{ type: "url", url }` / `{ type: "base64", data }` / `{ type: "provider_file", provider, id }`). `fetchBytes` and `fetchStream` must resolve with a `Uint8Array` (or a plain `number[]`).

### Path B -- subclass

`@blazen/sdk` exposes `ContentStore` as a real `wasm-bindgen` class, and JS subclasses dispatch back through the same adapter as the callback path. Call `super()` from your constructor to mark the instance as a subclass; subclasses MUST override `put`, `resolve`, and `fetchBytes`. `fetchStream` and `delete` remain optional.

```typescript
import { ContentStore } from "@blazen/sdk";
import type { ContentHandle, ImageSource } from "@blazen/sdk";

class IndexedDBContentStore extends ContentStore {
  constructor(private dbName = "blazen-content") {
    super();
  }

  async put(body, hint): Promise<ContentHandle> {
    // ...persist to IndexedDB / OPFS / fetch+rehost...
    return { id: "...", kind: "image", mime_type: hint?.mime_type };
  }

  async resolve(handle): Promise<ImageSource> {
    return { type: "url", url: "..." };
  }

  async fetchBytes(handle): Promise<Uint8Array> {
    return new Uint8Array([/* ... */]);
  }

  // Optional:
  async fetchStream(handle): Promise<Uint8Array> {
    return new Uint8Array([/* ... */]);
  }

  async delete(handle): Promise<void> {
    /* no-op */
  }
}

const store = new IndexedDBContentStore();
```

Forgetting to override one of the three required methods raises a clear runtime error the first time the base-class default is hit (`ContentStore subclass must override 'put()' (called the base-class default)`), so missing overrides fail fast rather than silently looping back into the base class.

## Streaming large content

The WASM binding streams chunk-by-chunk in both directions using the platform-native `ReadableStream<Uint8Array>`. `fetchStream` callbacks can return a `ReadableStream`, and a streaming `put` body arrives as `body.stream`, a `ReadableStream<Uint8Array>` you read with `getReader()`.

**Downloading.** Call `fetchStream(handle)` on any `ContentStore` wrapper:

```typescript
const rs = await store.fetchStream(handle);
const reader = rs.getReader();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  process(value); // Uint8Array
}
```

When you implement a custom store via `ContentStore.custom({ fetchStream })` or override `fetchStream` on a subclass, you have two options:

1. Return a `Uint8Array` / `number[]` for a single buffered chunk -- still supported.
2. Return a `ReadableStream<Uint8Array>` for chunk-by-chunk delivery. The browser's `fetch` already gives you one for free:

   ```typescript
   class CdnContentStore extends ContentStore {
     async fetchStream(handle: ContentHandle) {
       const response = await fetch(`https://cdn.example/${handle.id}`);
       return response.body; // ReadableStream<Uint8Array>
     }
   }
   ```

**Uploading.** When upstream Rust code hands your custom store a `ContentBody::Stream`, your `put(body, hint)` callback receives a body shaped `{ type: "stream", stream: ReadableStream<Uint8Array>, sizeHint: number | null }`. Read `body.stream` chunk-by-chunk:

```typescript
class CdnContentStore extends ContentStore {
  async put(body, hint) {
    if (body.type === "stream") {
      const reader = body.stream.getReader();
      while (true) {
        const { value, done } = await reader.read();
        if (done) break;
        this.uploader.append(value);
      }
      return this.uploader.finish();
    }
    // bytes / url / provider_file paths handled below...
  }
}
```

For round-tripping bytes when streaming isn't needed, `fetchBytes` still materializes the full payload as a `Uint8Array`:

```typescript
const handle = await store.put(
  new Uint8Array([/* ... */]),
  "image",
  "image/png",
);
const bytes = await store.fetchBytes(handle);
```

## See also

- [Multimodal Tool Results](/docs/guides/tool-multimodal) -- cross-cutting tool-result multimodal patterns.
- [WASM Agent](/docs/guides/wasm/agent) -- tool registration, handler shapes, and `runAgent` options.
- [WASM API Reference](/docs/api/wasm) -- complete export list, including `ContentStore`, `ContentHandle`, `ImageSource`, and the `*Input` helpers.

---

# Middleware & Composition

Source: https://blazen.dev/docs/guides/node/middleware
Language: node
Section: guides

Blazen models are immutable. Each decorator method (`withRetry()`, `withCache()`, `withFallback()`) returns a **new** `CompletionModel` that wraps the original, so you can layer behaviours without mutating anything.

## Retry

Wrap a model with automatic retry on transient failures (rate limits, timeouts, server errors). Retries use exponential backoff with jitter.

```typescript
import { CompletionModel } from "blazen";

const model = CompletionModel.openai({ apiKey: "sk-..." }).withRetry({
  maxRetries: 5,
  initialDelayMs: 500,
  maxDelayMs: 15000,
});
```

All config fields are optional:

| Field | Default | Description |
|---|---|---|
| `maxRetries` | `3` | Maximum retry attempts. |
| `initialDelayMs` | `1000` | Delay before the first retry (ms). |
| `maxDelayMs` | `30000` | Upper bound on backoff delay (ms). |

You can also call `withRetry()` with no argument to use the defaults.

## Cache

Cache identical non-streaming requests in memory so repeated prompts are served instantly without hitting the provider.

```typescript
const model = CompletionModel.openai({ apiKey: "sk-..." }).withCache({
  ttlSeconds: 600,
  maxEntries: 500,
});
```

| Field | Default | Description |
|---|---|---|
| `ttlSeconds` | `300` | How long a cached response stays valid. |
| `maxEntries` | `1000` | Maximum entries before eviction. |

Streaming requests (`model.stream(...)`) bypass the cache and always go to the provider.

## Fallback

Route requests through multiple providers in order. If the first provider fails with a transient error, the next one is tried automatically. Non-retryable errors (auth, validation) short-circuit immediately.

```typescript
const primary = CompletionModel.openai({ apiKey: "sk-..." });
const backup = CompletionModel.anthropic({ apiKey: "sk-ant-..." });

const model = CompletionModel.withFallback([primary, backup]);
```

`withFallback()` is a **static factory method** that takes an array of `CompletionModel` instances and returns a new `CompletionModel`.

## Composing Middleware

Because each decorator returns a new `CompletionModel`, you can chain them:

```typescript
const model = CompletionModel.openai({ apiKey: "sk-..." })
  .withCache({ ttlSeconds: 300 })
  .withRetry({ maxRetries: 3 });
```

The outermost wrapper executes first. In the example above, a request flows through retry first, then cache, then the provider:

```
request -> retry -> cache -> provider -> cache -> retry -> response
```

For maximum resilience, combine all three:

```typescript
const primary = CompletionModel.openai({ apiKey: "sk-..." }).withCache().withRetry();
const backup = CompletionModel.anthropic({ apiKey: "sk-ant-..." }).withRetry();

const model = CompletionModel.withFallback([primary, backup]);
```

This gives you caching on the primary, automatic retries on both, and automatic failover from OpenAI to Anthropic.

## Using Decorated Models

Decorated models are fully interchangeable with plain models. Pass them to `complete()`, `stream()`, `runAgent()`, or any workflow step:

```typescript
import { ChatMessage } from "blazen";

const response = await model.complete([
  ChatMessage.user("Explain quantum computing in one sentence."),
]);
console.log(response.content);
```

---

# Middleware & Composition

Source: https://blazen.dev/docs/guides/python/middleware
Language: python
Section: guides

Blazen models are immutable. Each decorator method (`with_retry()`, `with_cache()`, `with_fallback()`) returns a **new** `CompletionModel` that wraps the original, so you can layer behaviours without mutating anything.

## Retry

Wrap a model with automatic retry on transient failures (rate limits, timeouts, server errors). Retries use exponential backoff with jitter.

```python
from blazen import CompletionModel, ProviderOptions

model = CompletionModel.openai(
    options=ProviderOptions(api_key="sk-...")
).with_retry(
    max_retries=5,
    initial_delay_ms=500,
    max_delay_ms=15000,
)
```

All parameters are optional and keyword-only:

| Parameter | Default | Description |
|---|---|---|
| `max_retries` | `3` | Maximum retry attempts. |
| `initial_delay_ms` | `1000` | Delay before the first retry (ms). |
| `max_delay_ms` | `30000` | Upper bound on backoff delay (ms). |

The retry layer honours `Retry-After` headers from providers when present.

## Cache

Cache identical non-streaming requests in memory so repeated prompts are served instantly without hitting the provider.

```python
model = CompletionModel.openai(
    options=ProviderOptions(api_key="sk-...")
).with_cache(
    ttl_seconds=600,
    max_entries=500,
)
```

| Parameter | Default | Description |
|---|---|---|
| `ttl_seconds` | `300` | How long a cached response stays valid. |
| `max_entries` | `1000` | Maximum entries before eviction. |

Streaming requests (`model.stream(...)`) bypass the cache and always go to the provider.

## Fallback

Route requests through multiple providers in order. If the first provider fails with a transient error, the next one is tried automatically. Non-retryable errors (auth, validation) short-circuit immediately.

```python
primary = CompletionModel.openai(options=ProviderOptions(api_key="sk-..."))
backup = CompletionModel.anthropic(options=ProviderOptions(api_key="sk-ant-..."))

model = CompletionModel.with_fallback([primary, backup])
```

`with_fallback()` is a **static method** that takes a list of `CompletionModel` instances and returns a new `CompletionModel`.

## Composing Middleware

Because each decorator returns a new `CompletionModel`, you can chain them:

```python
model = (
    CompletionModel.openai(options=ProviderOptions(api_key="sk-..."))
    .with_cache(ttl_seconds=300)
    .with_retry(max_retries=3)
)
```

The outermost wrapper executes first. In the example above, a request flows through retry first, then cache, then the provider:

```
request -> retry -> cache -> provider -> cache -> retry -> response
```

For maximum resilience, combine all three:

```python
primary = (
    CompletionModel.openai(options=ProviderOptions(api_key="sk-..."))
    .with_cache()
    .with_retry()
)
backup = CompletionModel.anthropic(
    options=ProviderOptions(api_key="sk-ant-...")
).with_retry()

model = CompletionModel.with_fallback([primary, backup])
```

This gives you caching on the primary, automatic retries on both, and automatic failover from OpenAI to Anthropic.

## Using Decorated Models

Decorated models are fully interchangeable with plain models. Pass them to `complete()`, `stream()`, `run_agent()`, or any workflow step:

```python
from blazen import ChatMessage

response = await model.complete([
    ChatMessage.user("Explain quantum computing in one sentence.")
])
print(response.content)
```

---

# Multimodal Content

Source: https://blazen.dev/docs/guides/rust/multimodal
Language: rust
Section: guides

This guide covers Blazen's multimodal layer in Rust: typed **content handles**, the pluggable **`ContentStore`** trait, the built-in stores for OpenAI / Anthropic / Gemini / fal, and the JSON Schema helpers that let tools accept media as first-class arguments.

## Why content handles?

Models emit JSON, not bytes. Each provider also has its own file API -- OpenAI's `/v1/files`, Anthropic's Files beta, Gemini's File API, fal's storage endpoint -- each returning its own URI shape. A `ContentHandle` is the single source of truth: a typed reference to a blob (with `kind`, `mime_type`, optional `byte_size`, and `display_name`) that a `ContentStore` resolves into whichever wire form the destination provider expects. You hold one handle and the store routes it.

## `ContentKind`

`ContentKind` is the taxonomy Blazen uses to classify content. It is `#[non_exhaustive]` and serializes as snake_case.

| Variant | Wire tag | Description |
|---|---|---|
| `Image` | `image` | Photos, diagrams, screenshots, PNG/JPEG/WebP |
| `Audio` | `audio` | Speech, music, MP3/WAV/FLAC/OGG |
| `Video` | `video` | MP4/WebM/MOV clips |
| `Document` | `document` | PDFs, plain text, Markdown, office docs |
| `ThreeDModel` | `three_d_model` | glTF/GLB/OBJ/STL meshes |
| `Cad` | `cad` | STEP, IGES, native CAD formats |
| `Archive` | `archive` | ZIP/TAR/7z bundles |
| `Font` | `font` | TTF/OTF/WOFF |
| `Code` | `code` | Source files |
| `Data` | `data` | JSON/CSV/Parquet payloads |
| `Other` | `other` | Anything that does not fit above |

Convert from MIME or file extension, or sniff from raw bytes:

```rust
use blazen_llm::content::{ContentKind, detect_from_bytes};

let from_mime = ContentKind::from_mime("image/png");
assert_eq!(from_mime, ContentKind::Image);

let from_ext = ContentKind::from_extension("glb");
assert_eq!(from_ext, ContentKind::ThreeDModel);

let bytes = std::fs::read("photo.jpg")?;
let (kind, mime) = detect_from_bytes(&bytes);
println!("kind={} mime={:?}", kind.as_str(), mime);
```

For path-based detection on native targets, `detect_from_path` combines extension and magic-number sniffing. The fully general `detect(bytes, mime_hint, filename)` lets you pass any subset of signals.

## `ContentStore`

`ContentStore` is an async trait with five operations:

- `put(body, hint)` -- ingest raw bytes, a URL, a local path, or an existing provider file ID; return a `ContentHandle`.
- `resolve(handle)` -- produce an `ImageSource` (= `MediaSource`) the model providers can consume on the wire.
- `fetch_bytes(handle)` -- pull the underlying bytes back out (used by tools that need to read content directly).
- `metadata(handle)` -- size / MIME / display name (default impl reuses what is already on the handle).
- `delete(handle)` -- best-effort cleanup (default no-op).

`DynContentStore` is just `Arc<dyn ContentStore>` for shared ownership across handlers.

```rust
use blazen_llm::content::{
    ContentBody, ContentHint, ContentKind, ContentStore, InMemoryContentStore,
};

let store = InMemoryContentStore::new();

let handle = store
    .put(
        ContentBody::Url("https://example.com/diagram.png".into()),
        ContentHint::default()
            .with_mime_type("image/png")
            .with_kind(ContentKind::Image)
            .with_display_name("architecture.png"),
    )
    .await?;

let source = store.resolve(&handle).await?; // -> ImageSource::Url { .. }
let bytes = store.fetch_bytes(&handle).await?; // downloads and caches
```

## Built-in stores

| Store | Use case | `resolve` returns |
|---|---|---|
| `InMemoryContentStore` | Tests, ephemeral content, dev loops | `ImageSource::Base64` (or `Url` if put as a URL) |
| `LocalFileContentStore` | Disk-backed cache rooted at a directory (native only) -- accepts `ContentBody::Stream` for chunked `put`, overrides `fetch_stream` via `tokio_util::io::ReaderStream` | `ImageSource::File` |
| `OpenAiFilesStore` | Upload to OpenAI's Files API; reuse file IDs across requests -- overrides `fetch_stream` via the `HttpClient` trait's `send_streaming` method | `ImageSource::ProviderFile { provider: openai, .. }` |
| `AnthropicFilesStore` | Upload to Anthropic's Files API (beta header managed for you) -- overrides `fetch_stream` via `HttpClient::send_streaming` | `ImageSource::ProviderFile { provider: anthropic, .. }` |
| `GeminiFilesStore` | Upload to Google's File API (resumable) -- uses the buffered default `fetch_stream` because Gemini Files exposes no content-download endpoint | `ImageSource::ProviderFile { provider: google, .. }` |
| `FalStorageStore` | Stage media for fal.ai compute jobs -- overrides `fetch_stream` via `HttpClient::send_streaming` | `ImageSource::Url` (signed fal CDN URL) |
| `CustomContentStore` | Bring-your-own (S3, R2, GCS, internal CDN) -- builder exposes `.put`, `.resolve`, `.fetch_bytes`, `.fetch_stream`, `.delete` callbacks | Whatever your `resolve` closure returns |

Provider-file stores share the same shape -- construct with an API key, then `put` bytes plus a hint:

```rust
use blazen_llm::content::{
    AnthropicFilesStore, ContentBody, ContentHint, ContentKind, ContentStore,
};

let store = AnthropicFilesStore::new(std::env::var("ANTHROPIC_API_KEY")?);
let bytes = std::fs::read("report.pdf")?;
let handle = store
    .put(
        ContentBody::Bytes(bytes),
        ContentHint::default()
            .with_mime_type("application/pdf")
            .with_kind(ContentKind::Document)
            .with_display_name("Q4-report.pdf"),
    )
    .await?;
```

```rust
use blazen_llm::content::{
    ContentBody, ContentHint, ContentKind, ContentStore, OpenAiFilesStore,
};

let store = OpenAiFilesStore::new(std::env::var("OPENAI_API_KEY")?)
    .with_purpose("user_data");
let bytes = std::fs::read("chart.png")?;
let handle = store
    .put(
        ContentBody::Bytes(bytes),
        ContentHint::default()
            .with_mime_type("image/png")
            .with_kind(ContentKind::Image),
    )
    .await?;
```

```rust
use blazen_llm::content::{
    ContentBody, ContentHint, ContentKind, ContentStore, GeminiFilesStore,
};

let store = GeminiFilesStore::new(std::env::var("GOOGLE_API_KEY")?);
let bytes = std::fs::read("clip.mp4")?;
let handle = store
    .put(
        ContentBody::Bytes(bytes),
        ContentHint::default()
            .with_mime_type("video/mp4")
            .with_kind(ContentKind::Video),
    )
    .await?;
```

```rust
use blazen_llm::content::{
    ContentBody, ContentHint, ContentKind, ContentStore, FalStorageStore,
};

let store = FalStorageStore::new(std::env::var("FAL_KEY")?);
let bytes = std::fs::read("voice.wav")?;
let handle = store
    .put(
        ContentBody::Bytes(bytes),
        ContentHint::default()
            .with_mime_type("audio/wav")
            .with_kind(ContentKind::Audio),
    )
    .await?;
```

## `CustomContentStore`

Wire your own backend (S3, GCS, R2, an internal CDN) with closures. Each callback returns a boxed future that yields `Result<_, BlazenError>`. The builder exposes one setter per `ContentStore` method so you can pick exactly which paths you want to override.

```rust
use blazen_llm::content::{
    ContentBody, ContentHandle, ContentHint, ContentStore,
    CustomContentStore,
};
use blazen_llm::types::MediaSource;
use bytes::Bytes;
use futures_util::stream;
use std::sync::Arc;

let store: Arc<dyn ContentStore> = Arc::new(
    CustomContentStore::builder("my_s3_store")
        .put(|body, hint| Box::pin(async move {
            // upload `body` (bytes / URL / local path / stream / provider file)
            // to your backend, return a fresh ContentHandle.
            todo!()
        }))
        .resolve(|handle| Box::pin(async move {
            // map handle.id back to a wire-renderable MediaSource:
            // - MediaSource::Url for hosted URLs
            // - MediaSource::Base64 for inline content
            // - MediaSource::ProviderFile for native provider file ids
            todo!()
        }))
        .fetch_bytes(|handle| Box::pin(async move {
            // fetch the raw bytes (used by tools that need to read content directly).
            todo!()
        }))
        .fetch_stream(|handle| Box::pin(async move {
            // OPTIONAL: stream the bytes back chunk-by-chunk for large content.
            // When omitted, the trait's default impl buffers fetch_bytes into one chunk.
            let chunks: Vec<Result<Bytes, _>> = vec![Ok(Bytes::from_static(b"hello"))];
            Ok(Box::pin(stream::iter(chunks)) as blazen_llm::content::ByteStream)
        }))
        .delete(|handle| Box::pin(async move { Ok(()) }))
        .build()
        .unwrap(),
);
```

`build()` validates that `put`, `resolve`, and `fetch_bytes` are all wired; `fetch_stream` and `delete` are optional. When `fetch_stream` is omitted, the trait default buffers `fetch_bytes` into a single-chunk stream so existing callers keep working unchanged.

## `ImageSource` / `MediaSource` variants

`MediaSource` is a type alias for `ImageSource` -- the same enum represents every modality on the wire. It is `#[non_exhaustive]` and serde-tagged with `type` (snake_case).

| Variant | Purpose |
|---|---|
| `Url { url }` | Public or signed HTTPS URL the provider fetches directly |
| `Base64 { data }` | Inline base64 payload, used when the provider supports raw bytes |
| `File { path }` | Native local file path; readers turn this into bytes or upload to a provider |
| `ProviderFile { provider, id }` | Reference to a previously-uploaded provider file (OpenAI / Anthropic / Gemini / fal) |
| `Handle { handle }` | Unresolved `ContentHandle` -- replaced by one of the above when `resolve_handles_with` runs |

`ImageSource::file(path)` is a convenience for the `File` variant.

## Tool inputs

Most tools want to declare "I take an image" without hand-rolling JSON Schema. The helpers in `content::tool_input` produce ready-made schemas with the `x-blazen-content-ref` extension tag baked in. Providers ignore the extension, but Blazen's resolver picks it up.

```rust
use blazen_llm::content::tool_input::image_input;
use blazen_llm::types::ToolDefinition;

let analyze_photo = ToolDefinition {
    name: "analyze_photo".into(),
    description: "Analyze the visual contents of a photo".into(),
    parameters: image_input("photo", "the photo to analyze"),
    ..Default::default()
};
```

Pick the helper that matches the modality:

| Helper | Required arg name | Required arg kind |
|---|---|---|
| `image_input(name, desc)` | the supplied name | `Image` |
| `audio_input(name, desc)` | the supplied name | `Audio` |
| `video_input(name, desc)` | the supplied name | `Video` |
| `file_input(name, desc)` | the supplied name | `Document` |
| `three_d_input(name, desc)` | the supplied name | `ThreeDModel` |
| `cad_input(name, desc)` | the supplied name | `Cad` |

For tools that take media plus other parameters, build a richer schema with `content_ref_required_object` (full object) or splice in `content_ref_property` next to your other properties.

When the model calls the tool, it passes a handle ID as a plain string. Before your handler runs, call `resolve_tool_arguments` to swap that string for a typed object containing `{kind, handle_id, mime_type, byte_size, display_name, source}`:

```rust
use blazen_llm::content::tool_input::resolve_tool_arguments;
use blazen_llm::content::InMemoryContentStore;

let store = InMemoryContentStore::new();
let mut args: serde_json::Value = serde_json::from_str(r#"{ "photo": "handle_abc123" }"#)?;
let schema = serde_json::json!({
    "type": "object",
    "properties": {
        "photo": { "type": "string", "x-blazen-content-ref": { "kind": "image" } }
    }
});
let resolved_count = resolve_tool_arguments(&mut args, &schema, &store).await?;
println!("resolved {resolved_count} handles");
```

The Blazen agent runner does this automatically when a `ContentStore` is wired into the agent, so most callers never invoke `resolve_tool_arguments` directly. Reach for it when running tools outside an agent loop.

## Tool results with multimodal

Tools can return `LlmPayload::Parts { parts: Vec<ContentPart> }` -- a list mixing text, images, and other content. This now serializes correctly across **every** provider: Anthropic native carries the parts inside the tool result, while OpenAI Chat / Responses / Azure / fal / openai-compat / Gemini emit a follow-up multimodal user message immediately after the tool call so the model sees the visual output. See [Tool Multimodal](/guides/tool-multimodal/) for the full pattern.

## Resolving handles before the wire call

If your `CompletionRequest` contains messages with `ImageSource::Handle { .. }` content, call `resolve_handles_with` before sending it to a provider that does not understand handles natively:

```rust
use blazen_llm::content::InMemoryContentStore;
use blazen_llm::types::CompletionRequest;

let store = InMemoryContentStore::new();
let mut request = CompletionRequest::new("gpt-4o");
// ... attach messages with ImageSource::Handle entries ...
let replaced = request.resolve_handles_with(&store).await?;
println!("replaced {replaced} handle(s) with concrete sources");
```

For full conversations -- where you also want the model to know which handles exist by name and kind -- `prepare_request_with_store` does both jobs at once: it resolves every handle and prepends a system note describing them (built from `build_handle_directory_system_note`):

```rust
use blazen_llm::content::visibility::prepare_request_with_store;
use blazen_llm::content::InMemoryContentStore;
use blazen_llm::types::CompletionRequest;

let store = InMemoryContentStore::new();
let mut request = CompletionRequest::new("claude-sonnet-4-5");
// ... append user messages that reference handles ...
let resolved = prepare_request_with_store(&mut request, &store).await?;
println!("{resolved} handles resolved and announced to the model");
```

If you only want the directory note (without resolving), call `collect_visible_handles(&messages)` and feed the result to `build_handle_directory_system_note` yourself.

## Cargo features

The `content-detect` feature is on by default and pulls in the `infer` crate for magic-number sniffing inside `detect_from_bytes` / `detect_from_path`. If you only deal with bytes that already carry a MIME type, disable it for a smaller dependency tree:

```toml
[dependencies]
blazen-llm = { version = "*", default-features = false }
```

## Streaming large content

Multi-gigabyte uploads and downloads should not require buffering the whole payload in memory. Blazen exposes streaming on both ends: `ContentBody::Stream` for `put`, and `ContentStore::fetch_stream` for the read path.

The wire type is a single alias:

```rust
pub type ByteStream = Pin<Box<dyn Stream<Item = Result<Bytes, BlazenError>> + Send>>;
```

`ContentBody::Stream { stream: ByteStream, size_hint: Option<u64> }` is the new variant on the input side. `size_hint` lets stores choose between simple and resumable upload paths when the total length is known up front (e.g. from a `Content-Length` header).

`ContentStore::fetch_stream(&handle) -> Result<ByteStream, BlazenError>` is the new trait method on the output side. The default impl calls `fetch_bytes` and wraps the result in `stream::once`, so every existing store keeps compiling without changes. Stores backed by HTTP or disk override it for true incremental streaming:

- `LocalFileContentStore` -- uses `tokio_util::io::ReaderStream` over the on-disk file.
- `OpenAiFilesStore`, `AnthropicFilesStore`, `FalStorageStore` -- use the `HttpClient` trait's `send_streaming` method to forward the response body chunk-by-chunk without buffering.
- `InMemoryContentStore` and `GeminiFilesStore` -- use the buffered default. The in-memory store already holds the full bytes; Gemini Files exposes no content-download endpoint, so streaming wouldn't gain anything.

Streaming `put` example:

```rust
use blazen_llm::content::{ContentBody, ContentHint, LocalFileContentStore};
use bytes::Bytes;
use futures_util::stream;

let store = LocalFileContentStore::new("/var/cache/blazen")?;
let chunks = vec![
    Ok(Bytes::from_static(b"hello ")),
    Ok(Bytes::from_static(b"streaming world")),
];
let body = ContentBody::Stream {
    stream: Box::pin(stream::iter(chunks)),
    size_hint: Some(21),
};
let handle = store
    .put(body, ContentHint::default())
    .await?;
```

Streaming `fetch` example:

```rust
use futures_util::TryStreamExt;

let mut stream = store.fetch_stream(&handle).await?;
while let Some(chunk) = stream.try_next().await? {
    // process chunk: bytes::Bytes
}
```

Two caveats on the `Stream` variant:

- It is **not** `Clone`. Streams are single-shot by nature; the manual `Clone` impl on `ContentBody` panics with `unreachable!` on the `Stream` arm. Pass streaming bodies by value.
- It is **not** `Serialize` / `Deserialize` (the variant is `#[serde(skip)]`). It cannot round-trip through JSON, so the Python and Node bindings drain the stream into bytes when crossing the FFI boundary -- callers that need true end-to-end streaming should stay on the Rust API.

## See also

- [Tool Multimodal](/guides/tool-multimodal/) -- returning images and other media from tools across every provider
- [Custom Providers](/guides/custom-providers/) -- plug your own completion model into the same content pipeline
- [API Reference](/api/rust/) -- full rustdoc for `blazen_llm::content` and `blazen_llm::types`

---

# Middleware & Composition

Source: https://blazen.dev/docs/guides/wasm/middleware
Language: wasm
Section: guides

The WASM SDK supports the same middleware patterns as the Node.js SDK. Each decorator method returns a **new** `CompletionModel`, keeping the original unchanged.

## Retry

Wrap a model with automatic retry on transient failures. The WASM `withRetry()` takes an optional `maxRetries` number (default 3).

```typescript
import init, { CompletionModel } from "@blazen/sdk";

await init();

// The WASM SDK reads OPENAI_API_KEY from the runtime environment;
// factory methods do not accept arguments.
const model = CompletionModel.openai().withRetry(5);
```

## Cache

Cache identical non-streaming requests in memory.

```typescript
// ttlSeconds (default 300), maxEntries (default 1000)
const model = CompletionModel.openai().withCache(600, 500);
```

Streaming requests always bypass the cache.

## Fallback

Route requests through multiple providers in order.

```typescript
const primary = CompletionModel.openai(); // reads OPENAI_API_KEY
const backup = CompletionModel.groq();   // reads GROQ_API_KEY

const model = CompletionModel.withFallback([primary, backup]);
```

When the first provider fails with a transient error (rate limit, timeout, server error), the next provider is tried. Non-retryable errors short-circuit immediately.

## Composing Middleware

Chain decorators to layer multiple behaviours:

```typescript
const model = CompletionModel.openai()
  .withCache(300, 1000)
  .withRetry(3);
```

For maximum resilience, combine all three:

```typescript
// The WASM SDK exposes OpenAI-compatible providers only.
// Use OpenRouter for Claude/Gemini/etc. via a single key.
const primary = CompletionModel.openai().withCache().withRetry();
const backup = CompletionModel.openrouter().withRetry();

const model = CompletionModel.withFallback([primary, backup]);
```

## Using Decorated Models

Decorated models work identically to plain models -- pass them to `complete()`, `stream()`, or `runAgent()`:

```typescript
import { ChatMessage } from "@blazen/sdk";

const response = await model.complete([
  ChatMessage.user("Explain quantum computing in one sentence."),
]);
console.log(response.content);
```

---

# Embeddings

Source: https://blazen.dev/docs/guides/node/embeddings
Language: node
Section: guides

Blazen provides a unified `EmbeddingModel` interface for generating vector embeddings across multiple providers. The API mirrors `CompletionModel`: create a model with a static factory method, then call `embed()`.

## Create an Embedding Model

```typescript
import { EmbeddingModel } from "blazen";

// OpenAI (default: text-embedding-3-small, 1536 dimensions)
// Reads OPENAI_API_KEY from the environment by default.
const model = EmbeddingModel.openai();

// Or pass an API key explicitly.
const model = EmbeddingModel.openai({ apiKey: "sk-..." });

// Together AI
const model = EmbeddingModel.together({ apiKey: "tok-..." });

// Cohere
const model = EmbeddingModel.cohere({ apiKey: "co-..." });

// Fireworks AI
const model = EmbeddingModel.fireworks({ apiKey: "fw-..." });
```

## Generate Embeddings

Pass an array of strings to `embed()`. It returns an `EmbeddingResponse` with one vector per input text.

```typescript
const response = await model.embed(["Hello, world!", "Goodbye, world!"]);

console.log(response.embeddings.length);       // 2
console.log(response.embeddings[0].length);    // 1536 (dimensionality)
console.log(response.model);                   // "text-embedding-3-small"
```

## EmbeddingResponse

The response object has the following fields:

| Property | Type | Description |
|---|---|---|
| `.embeddings` | `number[][]` | One vector per input text. |
| `.model` | `string` | Model that produced the embeddings. |
| `.usage` | `TokenUsage \| undefined` | Token usage statistics. |
| `.cost` | `number \| undefined` | Estimated cost in USD. |
| `.timing` | `RequestTiming \| undefined` | Request timing breakdown. |

## Model Properties

```typescript
console.log(model.modelId);     // "text-embedding-3-small"
console.log(model.dimensions);  // 1536
```

## Local Embeddings

Blazen can generate embeddings entirely on your machine using its built-in embed backend. No API key, no network calls after the initial model download, and completely free. Blazen's embed backend runs through ONNX Runtime on glibc/mac/windows and pure-Rust tract on musl -- the facade picks the right underlying implementation automatically for your target.

### Setup

Local embeddings are available when Blazen is built with the `embed` feature. The default `npm install blazen` package includes it.

### Usage

```typescript
import { EmbeddingModel } from "blazen";

// Use the default model (BAAI/bge-small-en-v1.5, 384 dimensions)
const model = EmbeddingModel.embed();

// Or specify a model and other options explicitly
const model = EmbeddingModel.embed({
  modelName: "BGESmallENV15",
  cacheDir: "/tmp/models",
  maxBatchSize: 256,
  showDownloadProgress: true,
});

const response = await model.embed(["hello", "world"]);
console.log(response.embeddings.length);       // 2
console.log(response.embeddings[0].length);    // 384
```

### EmbedOptions

| Field | Type | Default | Description |
|---|---|---|---|
| `modelName` | `string \| undefined` | `"BGESmallENV15"` | Embed model variant name. |
| `cacheDir` | `string \| undefined` | backend default | Directory where downloaded models are cached. |
| `maxBatchSize` | `number \| undefined` | `256` | Maximum batch size for embedding. |
| `showDownloadProgress` | `boolean \| undefined` | `false` | Print a progress bar during model download. |

### Drop-in with Memory

A local embedding model is a regular `EmbeddingModel` -- it plugs into `Memory` with no changes:

```typescript
import { EmbeddingModel, Memory, InMemoryBackend } from "blazen";

const model = EmbeddingModel.embed();
const memory = new Memory(model, new InMemoryBackend());

await memory.add("doc1", "Paris is the capital of France");
const results = await memory.search("capital of France", 5);
```

### Model Download

The first call to `embed()` (or `memory.add()`) downloads the ONNX model weights. For `BGESmallENV15` the download is roughly 33 MB. After the first run the model is cached locally and no further network access is required.

## Use Cases

Embeddings are the building block for semantic search, RAG pipelines, clustering, and classification. A typical pattern inside a workflow step:

```typescript
import { EmbeddingModel } from "blazen";

const embedModel = EmbeddingModel.openai({ apiKey: "sk-..." });

wf.addStep("embed_documents", ["DocumentsReady"], async (event, ctx) => {
  const response = await embedModel.embed(event.documents);
  await ctx.set("vectors", response.embeddings);
  return { type: "SearchEvent" };
});
```

---

# Embeddings

Source: https://blazen.dev/docs/guides/python/embeddings
Language: python
Section: guides

Blazen provides a unified `EmbeddingModel` interface for generating vector embeddings across multiple providers. The API mirrors `CompletionModel`: create a model with a static constructor, then call `embed()`.

## Create an Embedding Model

```python
from blazen import EmbeddingModel, ProviderOptions

# OpenAI (default: text-embedding-3-small, 1536 dimensions)
# Reads OPENAI_API_KEY from the environment by default.
model = EmbeddingModel.openai()

# Or pass an API key explicitly via ProviderOptions.
model = EmbeddingModel.openai(options=ProviderOptions(api_key="sk-..."))

# OpenAI with a specific model and dimensionality
model = EmbeddingModel.openai(
    options=ProviderOptions(api_key="sk-..."),
    model="text-embedding-3-large",
    dimensions=3072,
)

# Together AI
model = EmbeddingModel.together(options=ProviderOptions(api_key="tok-..."))

# Cohere
model = EmbeddingModel.cohere(options=ProviderOptions(api_key="co-..."))

# Fireworks AI
model = EmbeddingModel.fireworks(options=ProviderOptions(api_key="fw-..."))
```

## Generate Embeddings

Pass a list of strings to `embed()`. It returns an `EmbeddingResponse` with one vector per input text.

```python
response = await model.embed(["Hello, world!", "Goodbye, world!"])

print(len(response.embeddings))       # 2
print(len(response.embeddings[0]))    # 1536 (dimensionality)
print(response.model)                 # "text-embedding-3-small"
```

## EmbeddingResponse

The response object exposes the following properties:

| Property | Type | Description |
|---|---|---|
| `.embeddings` | `list[list[float]]` | One vector per input text. |
| `.model` | `str` | Model that produced the embeddings. |
| `.usage` | `TokenUsage \| None` | Token usage statistics. |
| `.cost` | `float \| None` | Estimated cost in USD. |
| `.timing` | `RequestTiming \| None` | Request timing breakdown. |

## Model Properties

```python
print(model.model_id)    # "text-embedding-3-small"
print(model.dimensions)  # 1536
```

## Local Embeddings

Blazen can generate embeddings entirely on your machine using its built-in embed backend. No API key, no network calls after the initial model download, and completely free. Blazen's embed backend runs through ONNX Runtime on glibc/mac/windows and pure-Rust tract on musl -- the facade picks the right underlying implementation automatically for your target.

### Setup

Local embeddings are available when Blazen is built with the `embed` feature. The default `pip install blazen` wheels include it.

### Usage

```python
from blazen import EmbeddingModel, EmbedOptions

# Use the default model (BAAI/bge-small-en-v1.5, 384 dimensions)
model = EmbeddingModel.local()

# Or specify a model and other options explicitly
model = EmbeddingModel.local(
    options=EmbedOptions(
        model_name="BGESmallENV15",
        cache_dir="/tmp/models",
        max_batch_size=256,
        show_download_progress=True,
    )
)

response = await model.embed(["hello", "world"])
print(len(response.embeddings))       # 2
print(len(response.embeddings[0]))    # 384
```

### EmbedOptions

| Field | Type | Default | Description |
|---|---|---|---|
| `model_name` | `str \| None` | `"BGESmallENV15"` | Embed model variant name. |
| `cache_dir` | `str \| None` | backend default | Directory where downloaded models are cached. |
| `max_batch_size` | `int \| None` | `256` | Maximum batch size for embedding. |
| `show_download_progress` | `bool \| None` | `False` | Print a progress bar during model download. |

### Drop-in with Memory

A local embedding model is a regular `EmbeddingModel` -- it plugs into `Memory` with no changes:

```python
from blazen import EmbeddingModel, Memory, InMemoryBackend

model = EmbeddingModel.local()
memory = Memory(model, InMemoryBackend())

await memory.add("doc1", "Paris is the capital of France")
results = await memory.search("capital of France", limit=5)
```

### Model Download

The first call to `embed()` (or `Memory.add()`) downloads the ONNX model weights. For `BGESmallENV15` the download is roughly 33 MB. After the first run the model is cached locally and no further network access is required.

## Use Cases

Embeddings are the building block for semantic search, RAG pipelines, clustering, and classification. A typical pattern inside a workflow step:

```python
from blazen import step, Context, Event, EmbeddingModel, ProviderOptions

embed_model = EmbeddingModel.openai(options=ProviderOptions(api_key="sk-..."))

@step
async def embed_documents(ctx: Context, ev: Event):
    texts = ev.documents
    response = await embed_model.embed(texts)
    ctx.set("vectors", response.embeddings)
    return Event("SearchEvent")
```

---

# Multimodal Content

Source: https://blazen.dev/docs/guides/node/multimodal
Language: node
Section: guides

LLMs that accept images, audio, video, or arbitrary files all want the bytes in slightly different shapes -- a public URL, an inline base64 blob, or a file id from the provider's own upload API. Blazen wraps that mess behind a single abstraction: you stash the bytes in a `ContentStore`, hold onto the resulting `ContentHandle`, and let Blazen pick the cheapest wire form per provider at request-build time.

This guide covers the Node binding (`blazen` npm package). For the cross-cutting design notes, see [`/guides/tool-multimodal/`](/guides/tool-multimodal/).

## Why content handles?

LLMs emit JSON. JSON does not carry binary payloads gracefully -- you either base64 every blob (slow, expensive, hits payload limits) or shuffle around URLs the model can't actually reach. Worse, every provider has its own files API: OpenAI Files, Anthropic Files, Gemini Files, fal.ai storage. Wiring each one into your tool layer means you write the same upload-and-reference dance four times.

A `ContentHandle` is Blazen's single source of truth for "a piece of content somewhere." It carries an opaque id, a `ContentKind`, an optional MIME type, byte size, and display name -- enough metadata to route, cost-estimate, and cache, but no bytes inline. When the request hits the provider, Blazen's resolver asks the store: "what's the cheapest way to render this for the active provider?" -- typically URL > providerFile > base64 -- and serializes accordingly.

This means a single tool definition that returns a handle works against every provider Blazen supports, and your tool code never has to think about base64.

## `ContentKind`

The `JsContentKind` const enum is exported as both `JsContentKind` and `ContentKind` (a type alias). Each variant maps to a snake_case wire string used everywhere kinds are serialized:

| Variant | Wire string |
|---|---|
| `Image` | `"image"` |
| `Audio` | `"audio"` |
| `Video` | `"video"` |
| `Document` | `"document"` |
| `ThreeDModel` | `"three_d_model"` |
| `Cad` | `"cad"` |
| `Archive` | `"archive"` |
| `Font` | `"font"` |
| `Code` | `"code"` |
| `Data` | `"data"` |
| `Other` | `"other"` |

Both forms work interchangeably -- pass the enum variant when you have it, the string when you're crossing a JSON boundary.

## `ContentStore`

A store is a pluggable backend that persists bytes and hands back handles. Construct one via the static factories, then use the per-instance methods to manage content.

```typescript
import { ContentStore } from "blazen";
import type { ContentHandle, PutOptions } from "blazen";

const store = ContentStore.inMemory();

// Put bytes -- options carry hints; the store may auto-detect when omitted.
const photoBytes = await fetch("https://example.com/cat.png").then((r) => r.arrayBuffer());
const handle: ContentHandle = await store.put(Buffer.from(photoBytes), {
  kind: "image",
  mimeType: "image/png",
  displayName: "cat.png",
});

// Resolve to the wire-renderable MediaSource shape (URL > providerFile > base64).
const wire = await store.resolve(handle);

// Pull bytes back -- for tools that actually need to operate on the content
// (parse a PDF, transcribe audio, etc.).
const bytes: Buffer = await store.fetchBytes(handle);

// Cheap metadata lookup without materializing bytes.
const meta = await store.metadata(handle);
console.log(meta.kind, meta.mimeType, meta.byteSize, meta.displayName);

// Optional cleanup -- a no-op on most stores.
await store.delete(handle);
```

The `put` body argument is `Buffer | string`. When you pass a string, Blazen looks for `"://"`: if present, the string is recorded as a URL (no upload happens, the store just holds the reference); otherwise the string is treated as a local filesystem path (the store reads or copies it as needed). Pass a `Buffer` when you have raw bytes in memory.

`PutOptions` fields are all optional -- `mimeType`, `kind`, `displayName`, `byteSize`. Passing none is fine; the store does its best to detect from the body. Passing an explicit `kind` overrides any auto-detection, which matters if you want a `.bin` blob classified as `Cad` rather than `Other`.

## Built-in stores

| Factory | Purpose |
|---|---|
| `ContentStore.inMemory()` | Ephemeral in-process map. Good for tests and short-lived runs. |
| `ContentStore.localFile(root)` | Filesystem-backed under `root` (created if absent). |
| `ContentStore.openaiFiles(apiKey, baseUrl?)` | Backed by the OpenAI Files API. |
| `ContentStore.anthropicFiles(apiKey, baseUrl?)` | Backed by the Anthropic Files API. |
| `ContentStore.geminiFiles(apiKey, baseUrl?)` | Backed by the Gemini Files API. |
| `ContentStore.falStorage(apiKey, baseUrl?)` | Backed by fal.ai's storage API. |

Stores are cheap to clone -- internally they're an `Arc` -- so you can share one instance across multiple agents and requests without thinking about it.

```typescript
import { ContentStore } from "blazen";

// In-memory: fast, ephemeral, lost on process exit.
const memStore = ContentStore.inMemory();
await memStore.put(Buffer.from("hello"), { kind: "document", mimeType: "text/plain" });

// Local file: durable, lives under the given root.
const fileStore = ContentStore.localFile("/var/lib/blazen/content");
await fileStore.put("/home/me/diagram.png", { kind: "image", mimeType: "image/png" });

// OpenAI Files: the bytes live in OpenAI's file storage; resolve() returns a providerFile reference.
const oaiStore = ContentStore.openaiFiles(process.env.OPENAI_API_KEY!);
await oaiStore.put(Buffer.from(pdfBytes), {
  kind: "document",
  mimeType: "application/pdf",
  displayName: "report.pdf",
});

// Anthropic Files: same shape, Anthropic-side storage.
const antStore = ContentStore.anthropicFiles(process.env.ANTHROPIC_API_KEY!);

// Gemini Files: same shape, Gemini-side storage.
const gemStore = ContentStore.geminiFiles(process.env.GEMINI_API_KEY!);

// fal.ai storage: useful when the downstream consumer is fal.ai's own endpoints.
const falStore = ContentStore.falStorage(process.env.FAL_API_KEY!);
```

The optional `baseUrl` argument on the four provider-file stores lets you point at a proxy or a regional endpoint. Pass `null` (or omit it) to use the provider's default.

## Custom backends

When the built-in factories aren't enough -- you want S3, R2, your own database, or any other backend -- the Node binding gives you two paths that mirror the Rust `CustomContentStore::builder` API. Pick whichever feels more natural; both end up wrapped in the same Rust adapter that dispatches back into JS via threadsafe functions.

### Path A -- `ContentStore.custom({...})` factory

Pass a plain object of async callbacks. `put`, `resolve`, and `fetchBytes` are required; `fetchStream` and `delete` are optional. `name` is a short identifier used in error / tracing messages and defaults to `"custom"`.

```typescript
import { ContentStore } from "blazen";
import type { ContentHandle, ContentKind, JsContentMetadata } from "blazen";

const store = ContentStore.custom({
  put: async (body, hint) => {
    // body is one of:
    //   { type: "bytes",         data: number[] }
    //   { type: "url",           url: string }
    //   { type: "local_path",    path: string }
    //   { type: "provider_file", provider: string, id: string }
    // hint is a ContentHint dict (all fields optional).
    // Must resolve to a ContentHandle-shaped object.
    return {
      id: "blazen_xxx",
      kind: "image",
      mimeType: "image/png",
    };
  },
  resolve: async (handle) => ({
    sourceType: "url",
    url: "https://example.com/blob.png",
  }),
  fetchBytes: async (handle) => Buffer.from("...bytes..."),
  // Optional:
  fetchStream: async (handle) => Buffer.from("..."), // or return an AsyncIterable<Uint8Array> for true streaming -- see "Streaming large content" below
  delete: async (handle) => {
    /* no-op */
  },
  name: "my_s3_store",
});
```

`fetchBytes` (and `fetchStream`) may return a `Buffer`, `Uint8Array`, `number[]`, or a base64 `string` -- the binding accepts all four shapes.

### Path B -- subclass `ContentStore`

`class MyStore extends ContentStore` works directly. Subclasses MUST override `put`, `resolve`, and `fetchBytes`; `fetchStream` and `delete` are optional. Don't call `super().put(...)` from a subclass -- the base-class methods raise on a `Subclass` instance because they exist only as a sentinel for the `super()` constructor.

```typescript
import { ContentStore } from "blazen";
import type { ContentHandle } from "blazen";

class S3ContentStore extends ContentStore {
  constructor(bucket: string) {
    super();
    this.bucket = bucket;
  }

  async put(body, hint) {
    // ...upload to S3, mint an id...
    return { id: "...", kind: "image" };
  }

  async resolve(handle) {
    return { sourceType: "url", url: "https://my-bucket.s3.amazonaws.com/..." };
  }

  async fetchBytes(handle) {
    return Buffer.from("...");
  }

  // Optional overrides:
  async fetchStream(handle) {
    return Buffer.from("...");
  }
  async delete(handle) {
    /* no-op */
  }
}
```

When a subclass instance is handed to a Blazen API that needs a content store, the binding wraps the JS object in an internal adapter that dispatches each call back into your overrides via threadsafe functions. The adapter checks at construction time that the three required methods exist and surfaces a clear error if any are missing.

## `MediaSource` on the wire

`store.resolve(handle)` returns a serialized `MediaSource` JS object -- the same JSON shape Blazen's request builders accept. The `sourceType` discriminator tells you which payload form the store picked:

```typescript
// URL form -- cheapest when the provider can fetch the URL itself.
{
  sourceType: "url",
  url: "https://cdn.example.com/cat.png"
}

// providerFile form -- when the bytes already live in the active provider's
// file API (OpenAI Files, Anthropic Files, Gemini Files, fal.ai storage).
{
  sourceType: "providerFile",
  provider: "openai",
  id: "file-abc123"
}

// base64 form -- the fallback when neither URL nor providerFile is available
// (e.g. inMemory store + a provider that only takes inline payloads).
{
  sourceType: "base64",
  data: "<base64-encoded bytes>"
}

// handle form -- carried in messages before resolution; the request builder
// swaps it for one of the three above when serializing for the active provider.
{
  sourceType: "handle",
  handleId: "...",
  handleKind: "image"
}
```

You normally don't construct these by hand. Blazen carries `handle`-form sources inside messages, and the resolver swaps them for the cheapest wire form during request build. The `MediaSource` type alias re-exports this object so you can type-narrow on `sourceType` if you ever inspect a resolved value.

## Tool inputs

The six helper functions -- `imageInput`, `audioInput`, `videoInput`, `fileInput`, `threeDInput`, `cadInput` -- generate JSON Schema fragments shaped for `runAgent`'s `tools` array. Each returns:

```typescript
{
  type: "object",
  properties: {
    [name]: {
      type: "string",
      description,
      "x-blazen-content-ref": { kind: "image" }   // or audio / video / document / three_d_model / cad
    }
  },
  required: [name]
}
```

The `x-blazen-content-ref` extension is a custom JSON Schema key. LLM providers ignore unknown keys, so the schema looks like a plain `string` parameter to the model. Blazen's resolver reads the extension and substitutes the handle id the model emits with the resolved typed content shape `{ kind, handleId, mimeType, byteSize, displayName, source }` before your tool handler runs. Your handler never sees raw handle ids -- it sees an already-resolved object.

```typescript
import {
  CompletionModel,
  ChatMessage,
  ContentStore,
  imageInput,
  runAgent,
} from "blazen";

const model = CompletionModel.openai({ apiKey: process.env.OPENAI_API_KEY! });
const store = ContentStore.inMemory();

// Pre-stash a photo and surface its handle id to the model in a user message.
const handle = await store.put(Buffer.from(photoBytes), {
  kind: "image",
  mimeType: "image/png",
  displayName: "cat.png",
});

const result = await runAgent(
  model,
  [
    ChatMessage.user(
      `Here is a photo (handle id: ${handle.id}). Call analyze_photo on it.`,
    ),
  ],
  [
    {
      name: "analyze_photo",
      description: "Analyze the given photo and describe what you see.",
      parameters: imageInput("photo", "The photo to analyze"),
    },
  ],
  async (toolName, args) => {
    if (toolName === "analyze_photo") {
      // `args.photo` has already been resolved by Blazen -- it is the typed
      // content shape, NOT the raw handle-id string.
      const { kind, handleId, mimeType, byteSize, displayName, source } = args.photo;
      const bytes = await store.fetchBytes({ id: handleId, kind });
      // ...inspect, OCR, classify, whatever the tool actually does...
      return { description: "A grumpy tabby cat sitting on a keyboard." };
    }
    throw new Error(`Unknown tool: ${toolName}`);
  },
  { maxIterations: 5 },
);
```

The other five helpers work identically -- swap `imageInput` for `audioInput`, `videoInput`, `fileInput` (for `Document` kind), `threeDInput`, or `cadInput` depending on what the tool consumes. The `x-blazen-content-ref.kind` baked into the schema tells Blazen which `ContentKind` to expect when resolving.

`runAgent`'s signature is `runAgent(model, messages, tools, toolHandler, options?)` -- see [`/api/node/`](/api/node/) for the full surface, including `runAgentWithCallback` for event observation.

## Tool results with multimodal

When a tool wants to return multimodal content (an image it generated, audio it transcribed, etc.) the cross-cutting `ToolOutput` + `LlmPayload` parts override produces the correct multimodal serialization across every provider -- Anthropic gets a native multimodal tool result, others receive an automatic follow-up user message. See [`/guides/tool-multimodal/`](/guides/tool-multimodal/) for the payload shape and worked examples.

## Streaming large content

The Node binding streams chunk-by-chunk in both directions across the FFI boundary. `fetchStream` callbacks can return `AsyncIterable<Uint8Array>`, and a streaming `put` body arrives as `body.stream`, an `AsyncIterable<Uint8Array>` you consume with `for await`.

**Downloading.** Call `fetchStream(handle)` on any `ContentStore` wrapper:

```typescript
const iter = await store.fetchStream(handle);
for await (const chunk of iter) {
  process(chunk); // chunk is a Uint8Array
}
```

When you implement a custom store via `ContentStore.custom({ fetchStream })` or override `fetchStream` on a subclass, you have two options:

1. Return a `Buffer` / `Uint8Array` / `number[]` / base64 `string` for a single buffered chunk -- still supported.
2. Return any `AsyncIterable<Uint8Array>` (a Node `Readable` qualifies, since it implements `[Symbol.asyncIterator]`) for chunk-by-chunk delivery:

   ```typescript
   class S3ContentStore extends ContentStore {
     async *fetchStream(handle: ContentHandle) {
       for await (const chunk of this.s3.getObjectStream(handle.id)) {
         yield chunk; // Uint8Array
       }
     }
   }
   ```

**Uploading.** When upstream Rust code hands your custom store a `ContentBody::Stream`, your `put(body, hint)` callback receives a body shaped `{ type: "stream", stream: AsyncIterable<Uint8Array>, sizeHint: number | null }`. Consume `body.stream` chunk-by-chunk without buffering:

```typescript
class S3ContentStore extends ContentStore {
  async put(body, hint) {
    if (body.type === "stream") {
      for await (const chunk of body.stream) {
        this.uploader.append(chunk);
      }
      return this.uploader.finish();
    }
    // bytes / url / local_path / provider_file paths handled below...
  }
}
```

Backpressure is honored across the FFI boundary via a small bounded channel (4 chunks), so a slow consumer pauses the producer naturally.

For round-tripping bytes when streaming isn't needed, `fetchBytes` still materializes the full payload as a `Buffer`:

```typescript
const handle = await store.put(Buffer.from("..."), {
  kind: "image",
  mimeType: "image/png",
});
const bytes = await store.fetchBytes(handle);
```

## See also

- [`/guides/tool-multimodal/`](/guides/tool-multimodal/) -- cross-cutting design notes and provider behavior.
- [`/api/node/`](/api/node/) -- full Node API reference, including `runAgent`, `Agent`, and the `ContentStore` surface.

---

# Multimodal Content

Source: https://blazen.dev/docs/guides/python/multimodal
Language: python
Section: guides

This guide covers Blazen's content layer in Python: how to register multimodal payloads with a `ContentStore`, how the `ContentHandle` indirection works, and how to declare tools that accept images, audio, video, documents, 3D models, or CAD files as inputs.

## Why content handles?

LLM providers do not agree on how multimodal data crosses the wire. OpenAI exposes a Files API and accepts image URLs inline; Anthropic has its own Files API (beta) plus inline base64; Gemini has Files; fal.ai has its own object storage. On top of that, the model itself only ever emits JSON -- it has no mechanism to attach a 5 MB PNG to a tool call.

A `ContentHandle` is the single source of truth that bridges those worlds. You `put` content into a store once and receive a stable handle. The handle is the only thing that travels through prompts, tool calls, and step boundaries. When the framework needs to render content for a specific provider, it asks the store to `resolve` the handle into whatever wire form fits -- a URL, base64, or a provider file ID.

The same indirection lets tools accept content as input: a tool declares an `image_input` parameter, the model emits a handle ID as a JSON string, and the framework substitutes the resolved content before your tool handler runs.

## `ContentKind`

Every piece of content has a kind. Tool-input declarations and store routing both branch on it.

| Variant | Wire name | Typical MIME |
|---|---|---|
| `Image` | `image` | `image/png`, `image/jpeg`, `image/webp` |
| `Audio` | `audio` | `audio/mpeg`, `audio/wav`, `audio/ogg` |
| `Video` | `video` | `video/mp4`, `video/webm` |
| `Document` | `document` | `application/pdf`, `text/plain` |
| `ThreeDModel` | `three_d_model` | `model/gltf+json`, `model/obj` |
| `Cad` | `cad` | `application/step`, `application/dxf` |
| `Archive` | `archive` | `application/zip` |
| `Font` | `font` | `font/ttf`, `font/woff2` |
| `Code` | `code` | `text/x-python` |
| `Data` | `data` | `application/json`, `text/csv` |
| `Other` | `other` | unknown |

Three lookup helpers cover the common parsing cases:

```python
from blazen import ContentKind

ContentKind.from_str("image")          # ContentKind.Image
ContentKind.from_str("three_d_model")  # ContentKind.ThreeDModel
ContentKind.from_mime("image/png")     # ContentKind.Image
ContentKind.from_extension("PNG")      # ContentKind.Image (case-insensitive, no leading dot)
ContentKind.from_mime("application/x-weird")  # ContentKind.Other (fallback, never raises)

ContentKind.ThreeDModel.name_str       # "three_d_model" -- the canonical wire name
```

`from_str` raises `ValueError` on unknown wire names; `from_mime` and `from_extension` fall back to `ContentKind.Other` so they never blow up on unfamiliar inputs.

## `ContentStore`

A store is the lifecycle manager for content: register bytes / URLs / paths, resolve handles into wire forms, fetch raw bytes back out, look up metadata, and clean up when you are done.

```python
import asyncio
from blazen import ContentKind, ContentStore

async def main() -> None:
    store = ContentStore.in_memory()

    # 1. Register raw bytes -- store auto-detects kind/MIME if you omit the hints.
    handle = await store.put(
        b"\x89PNG\r\n\x1a\n...binary png data...",
        kind=ContentKind.Image,
        mime_type="image/png",
        display_name="logo.png",
    )

    print(handle.id)            # "blazen_a1b2c3d4..."
    print(handle.kind)          # ContentKind.Image
    print(handle.mime_type)     # "image/png"
    print(handle.byte_size)     # populated when known
    print(handle.display_name)  # "logo.png"

    # 2. Resolve it into a wire-renderable MediaSource dict.
    source = await store.resolve(handle)
    # In-memory store returns base64:
    # {"type": "base64", "data": "iVBORw0KGgo...", "media_type": "image/png"}

    # 3. Fetch the raw bytes back. Reference-only stores (URL / provider) may
    #    raise UnsupportedError instead.
    raw = await store.fetch_bytes(handle)
    assert isinstance(raw, bytes)

    # 4. Cheap metadata lookup -- never materializes the bytes.
    meta = await store.metadata(handle)
    # {"kind": "image", "mime_type": "image/png", "byte_size": 12345,
    #  "display_name": "logo.png"}

    # 5. Clean up.
    await store.delete(handle)

asyncio.run(main())
```

`put` accepts three body shapes: `bytes` (the content itself), a `str` (treated as a URL reference), or a `pathlib.Path` (read from disk on demand by stores that support it). All keyword hints (`kind`, `mime_type`, `display_name`, `byte_size`) are optional -- pass what you know, the store fills in what it can.

`resolve` returns a serialized `MediaSource` dict whose `type` field tells you which variant came back:

| `type` | Other fields | Used for |
|---|---|---|
| `url` | `url` | Public URL references (fal storage, user-supplied URLs). |
| `base64` | `data`, `media_type` | Inline payloads (in-memory store, small files). |
| `provider_file` | `provider`, `id` | Pre-uploaded provider Files API entries. |

The framework picks the right wire form for each provider automatically when a store is wired to an agent; you rarely call `resolve` by hand outside of debugging.

## Built-in stores

| Factory | Backing | Notes |
|---|---|---|
| `ContentStore.in_memory()` | Process memory | Ephemeral; resolves to base64. Good for tests and short-lived runs. |
| `ContentStore.local_file(path)` | Filesystem rooted at `path` | Directory created recursively if missing. Resolves to base64 or path. |
| `ContentStore.openai_files(api_key)` | OpenAI Files API | `put` uploads, `resolve` returns a `provider_file` entry. |
| `ContentStore.anthropic_files(api_key)` | Anthropic Files API (beta) | Same shape as OpenAI's factory. |
| `ContentStore.gemini_files(api_key)` | Google Gemini Files API | Same shape. |
| `ContentStore.fal_storage(api_key)` | fal.ai object storage | Resolves to a `url` entry. |

Each provider-file factory accepts an optional `base_url` keyword for self-hosted gateways or proxies. The `put` interface is identical across all of them:

```python
from blazen import ContentKind, ContentStore

# OpenAI Files API
openai_store = ContentStore.openai_files(api_key="sk-...")
handle = await openai_store.put(
    b"...bytes...",
    kind=ContentKind.Image,
    mime_type="image/png",
)

# Anthropic Files API (beta)
anthropic_store = ContentStore.anthropic_files(api_key="sk-ant-...")
handle = await anthropic_store.put(
    b"...bytes...",
    kind=ContentKind.Document,
    mime_type="application/pdf",
)

# Google Gemini Files API
gemini_store = ContentStore.gemini_files(api_key="AIza...")
handle = await gemini_store.put(
    b"...bytes...",
    kind=ContentKind.Video,
    mime_type="video/mp4",
)

# fal.ai object storage (resolves to a public URL)
fal_store = ContentStore.fal_storage(api_key="...")
handle = await fal_store.put(
    b"...bytes...",
    kind=ContentKind.Image,
    mime_type="image/png",
)
```

A store-backed handle stays valid as long as the upstream entry exists. Provider-file stores will issue an upstream delete call when you call `await store.delete(handle)`; the in-memory and local-file stores drop the local entry.

## Custom stores

Two equivalent paths cover arbitrary backends -- S3, an internal blob service, anything you can implement async I/O against. Both produce the same `ContentStore` Python object the rest of the framework consumes; the only difference is whether your code lives in standalone callables or in a class.

**Path A -- callback factory.** `ContentStore.custom(...)` mirrors the Rust `CustomContentStore::builder` API. `put`, `resolve`, and `fetch_bytes` are required; `fetch_stream` and `delete` are optional and fall back to sane defaults if you omit them.

```python
from blazen import ContentHandle, ContentStore

# put receives a serialized ContentBody dict and a ContentHint dict.
# Body shapes:
#   {"type": "bytes", "data": [...]}
#   {"type": "url", "url": "..."}
#   {"type": "local_path", "path": "..."}
#   {"type": "provider_file", "provider": "openai", "id": "..."}
async def my_put(body: dict, hint: dict) -> ContentHandle:
    ...
    return ContentHandle(
        id="blazen_xxx",
        kind="image",
        mime_type="image/png",
    )

# resolve must return a serialized MediaSource dict.
async def my_resolve(handle: ContentHandle) -> dict:
    return {"type": "url", "url": "https://example.com/blob.png"}

async def my_fetch_bytes(handle: ContentHandle) -> bytes:
    return b"...bytes..."

store = ContentStore.custom(
    put=my_put,
    resolve=my_resolve,
    fetch_bytes=my_fetch_bytes,
    # fetch_stream=... and delete=... are optional
    name="my_s3_store",
)
```

**Path B -- subclass.** `ContentStore` is subclassable. Override `put`, `resolve`, and `fetch_bytes`; optionally override `fetch_stream`, `delete`, and `metadata`. Subclasses that forget to override a required method get a clear `NotImplementedError` rather than silent recursion.

```python
from blazen import ContentHandle, ContentStore

class S3ContentStore(ContentStore):
    def __init__(self, bucket: str):
        super().__init__()
        self.bucket = bucket

    async def put(self, body, hint) -> ContentHandle:
        ...

    async def resolve(self, handle) -> dict:
        ...

    async def fetch_bytes(self, handle) -> bytes:
        ...

    # Optional overrides (defaults are reasonable if you skip them):
    async def fetch_stream(self, handle): ...
    async def delete(self, handle) -> None: ...
```

Pick whichever style matches your code -- the framework treats both identically. The callback form is convenient for one-off wiring inside a function; the subclass form is better when the backend carries state (clients, credentials, caches) that you want to keep in `__init__`.

## Tool inputs

A tool that accepts content declares the parameter with one of the typed helpers. Each helper returns a JSON Schema fragment shaped like an object with a single required string property carrying the `x-blazen-content-ref` extension:

```python
from blazen import image_input

schema = image_input("photo", "The photo to analyze")
# {
#   "type": "object",
#   "properties": {
#     "photo": {
#       "type": "string",
#       "description": "The photo to analyze",
#       "x-blazen-content-ref": {"kind": "image"}
#     }
#   },
#   "required": ["photo"]
# }
```

The `x-blazen-content-ref` extension is invisible to the model -- providers ignore unknown JSON Schema keys -- but the framework's tool-argument resolver looks for it to know which string properties hold handle IDs. When the model emits a tool call like `{"photo": "blazen_a1b2c3d4..."}` and an agent has a content store wired in, the resolver substitutes the resolved content before your handler runs. Your tool sees a dict shaped roughly like:

```python
{
    "kind": "image",
    "handle_id": "blazen_a1b2c3d4...",
    "mime_type": "image/png",
    "byte_size": 12345,
    "display_name": "logo.png",
    "source": {"type": "base64", "data": "iVBOR...", "media_type": "image/png"},
}
```

The `source` field carries the same serialized `MediaSource` dict that `await store.resolve(handle)` would return.

There is one helper per common kind. They all have the signature `(name: str, description: str) -> dict`:

```python
from blazen import (
    audio_input,
    cad_input,
    file_input,
    image_input,
    three_d_input,
    video_input,
)

image_input("photo",   "A photo to analyze")          # kind: image
audio_input("clip",    "An audio clip to transcribe") # kind: audio
video_input("scene",   "A video scene to summarize")  # kind: video
file_input("doc",      "A document to extract from")  # kind: document
three_d_input("mesh",  "A 3D model to inspect")       # kind: three_d_model
cad_input("part",      "A CAD part to validate")      # kind: cad
```

For kinds without a dedicated helper (`Archive`, `Font`, `Code`, `Data`, `Other`), use `content_ref_property` to build the property fragment yourself, or `content_ref_required_object` to build a full object schema. See the next section.

## Composing extra fields

Tools that take content **plus** other arguments need a richer schema than the single-property helpers produce. `content_ref_required_object` is the building block: it builds the same object shape as `image_input` & friends, but lets you mix in additional non-content properties via `extra_properties`.

```python
from blazen import ContentKind, content_ref_required_object

schema = content_ref_required_object(
    "photo",
    ContentKind.Image,
    "Photo to analyze",
    extra_properties={
        "note": {
            "type": "string",
            "description": "Optional caller-supplied note about the photo.",
        },
        "include_exif": {
            "type": "boolean",
            "description": "Include EXIF metadata in the response.",
        },
    },
)
# {
#   "type": "object",
#   "properties": {
#     "photo": {"type": "string", "description": "Photo to analyze",
#                "x-blazen-content-ref": {"kind": "image"}},
#     "note": {"type": "string", "description": "..."},
#     "include_exif": {"type": "boolean", "description": "..."}
#   },
#   "required": ["photo"]
# }
```

Only the content reference is added to `required` automatically; mark extra properties as required by adding them to the schema you build around it (or, more often, leave them optional and validate inside the handler).

If you need just the property fragment -- because you are already building a larger object schema by hand -- use `content_ref_property` instead:

```python
from blazen import ContentKind, content_ref_property

photo_prop = content_ref_property(ContentKind.Image, "Photo to analyze")
# {"type": "string", "description": "Photo to analyze",
#  "x-blazen-content-ref": {"kind": "image"}}

custom_schema = {
    "type": "object",
    "properties": {
        "photo": photo_prop,
        "note":  {"type": "string"},
    },
    "required": ["photo", "note"],
}
```

`content_ref_property` is what `image_input`, `audio_input`, etc. use internally for the inner property; `content_ref_required_object` is what they use for the outer object. Reach for them when the dedicated helpers do not give you enough room.

## Tool results with multimodal

Tool results can carry multimodal content too -- a tool that runs OCR or generates an image returns a `ToolOutput` whose `llm_override` is a multi-part `LlmPayload`. The framework serializes those parts correctly across every provider (Anthropic gets a native multi-part tool result; others receive a follow-up user message). See the cross-cutting [tool multimodal guide](/guides/tool-multimodal/) for the full pattern.

## Pre-resolving handles before sending

When a `ContentStore` is wired into an agent, the framework resolves every handle that crosses the wire automatically: tool-call arguments get substituted before your handler runs, and content parts in messages get rendered in whatever form the active provider expects. There is no separate Python `resolve_tool_arguments` you have to call -- that machinery lives below the binding and fires implicitly. See the [Python API reference](/api/python/) for the full set of agent-construction signatures.

If you need a wire-form representation by hand (for logging, snapshotting, or custom transport), `await store.resolve(handle)` returns the same serialized `MediaSource` dict the framework uses internally.

## Streaming large content

Blazen's content layer streams chunk-by-chunk in both directions through the Python binding. `fetch_stream` callbacks can return `AsyncIterator[bytes]`, and a streaming `put` body arrives as an `AsyncByteIter` you iterate with `async for`.

**Downloading.** Call `fetch_stream(handle)` on any `ContentStore` wrapper to iterate the bytes:

```python
iter = await store.fetch_stream(handle)
async for chunk in iter:
    process(chunk)
```

When you implement a custom store via `ContentStore.custom(fetch_stream=...)` or override `fetch_stream` on a subclass, you have two options:

1. Return `bytes` for a single buffered chunk -- still supported.
2. Return an async generator (or anything implementing `__aiter__` and yielding `bytes` / `bytearray` / buffer-protocol objects) for chunk-by-chunk delivery:

   ```python
   class S3ContentStore(ContentStore):
       async def fetch_stream(self, handle):
           async for chunk in self.s3.get_object_stream(handle.id):
               yield chunk
   ```

If you omit `fetch_stream` entirely, the framework falls back to `fetch_bytes`.

**Uploading.** When upstream Rust code hands your custom store a `ContentBody::Stream`, your `put(body, hint)` callback receives a body shaped `{"type": "stream", "stream": <AsyncByteIter>, "size_hint": int | None}`. Iterate `body["stream"]` to consume chunks without buffering the whole payload:

```python
class S3ContentStore(ContentStore):
    async def put(self, body, hint):
        if body["type"] == "stream":
            async for chunk in body["stream"]:
                self.uploader.append(chunk)
            return self.uploader.finish()
        # bytes / url / local_path / provider_file paths handled below...
```

Backpressure is honored across the FFI boundary via a small bounded channel (4 chunks), so a slow consumer pauses the producer naturally.

**Built-in streaming stores.** These pull bytes from the network or disk chunk-by-chunk so any `fetch_stream` call against them streams end-to-end with no host code involved:

| Store | `fetch_stream` |
|---|---|
| `local_file` | streamed (file read in chunks) |
| `openai_files` | streamed (HTTP response body) |
| `anthropic_files` | streamed (HTTP response body) |
| `fal_storage` | streamed (HTTP response body) |
| `in_memory` | buffered (no underlying source to stream from) |
| `gemini_files` | buffered (Gemini Files exposes no download endpoint) |

## See also

- [Tool multimodal](/guides/tool-multimodal/) -- cross-cutting guide for tool results that return images, audio, etc.
- [Python API reference](/api/python/) -- full signatures for `ContentStore`, `ContentHandle`, and the input helpers.
- [Middleware & composition](/guides/python/middleware/) -- layer retry / cache / fallback around models that consume multimodal content.

---

# Distributed Workflows

Source: https://blazen.dev/docs/guides/distributed
Section: guides

Blazen workflows normally run in a single process. The `blazen-peer` crate extends this to multiple machines: a parent workflow on machine A can delegate a sub-workflow to machine B over gRPC, get the result back, and lazily dereference any session refs that stayed on the remote peer.

## When to use distributed workflows

- **Privacy boundaries.** Keep sensitive data on one machine and run the steps that touch it there, while the orchestrating workflow lives elsewhere.
- **Cost optimization.** Route GPU-intensive steps (embedding generation, image diffusion) to machines with the right hardware instead of paying for GPU on every node.
- **Hardware-specific steps.** Some steps require specialized hardware (TPUs, large-memory instances, local SSDs). Run those steps on the machine that has the hardware.
- **Regulatory compliance.** Data residency requirements may mandate that certain processing happens in a specific region or on specific infrastructure.

## How it works

1. The parent builds a `SubWorkflowRequest` containing a workflow name, an ordered list of step IDs, and a JSON input.
2. The request is sent over a tonic gRPC channel (HTTP/2, optional mTLS) to the peer.
3. The peer resolves each step ID against its local step registry, assembles a `Workflow`, and runs it.
4. The peer returns a `SubWorkflowResponse` with the terminal result, exported state values, and `RemoteRefDescriptor` handles for any session refs that could not be serialized inline.
5. The parent can dereference remote refs lazily over the same channel, and release them when done.

## Setup (Rust)

### 1. Register steps on both sides

Each machine registers the steps it can execute in the global step registry:

```rust
use blazen_core::register_step_builder;

// On the peer (machine B) -- register the steps it will run.
register_step_builder("my_app::analyze", my_analyze_step_builder);
register_step_builder("my_app::summarize", my_summarize_step_builder);
```

### 2. Start the peer server

```rust
use blazen_peer::BlazenPeerServer;

let server = BlazenPeerServer::new("node-b");
server.serve("0.0.0.0:50051".parse()?).await?;
```

The server uses an internal `SessionRefRegistry` by default. To share a registry with in-process workflows running on the same machine, call `.with_session_refs(arc_registry)`.

### 3. Connect the client and invoke

```rust
use blazen_peer::BlazenPeerClient;
use blazen_peer::SubWorkflowRequest;

let mut client = BlazenPeerClient::connect("http://peer-b:50051", "node-a").await?;

let input = serde_json::json!({ "document": "..." });
let request = SubWorkflowRequest::new(
    "analyze-pipeline",
    vec!["my_app::analyze".to_string(), "my_app::summarize".to_string()],
    &input,
    Some(60), // timeout in seconds
)?;

let response = client.invoke_sub_workflow(request).await?;

if let Some(err) = &response.error {
    eprintln!("remote workflow failed: {err}");
} else {
    let result = response.result_value()?;
    println!("result: {result:?}");
}
```

## Python example (planned API)

Python bindings for `blazen-peer` are not yet wired. The planned API will expose `PeerClient` and integrate with `Workflow.run_remote`:

```python
from blazen import Workflow, step, Event, StopEvent, Context
from blazen.peer import PeerClient

# Connect to a remote peer.
client = await PeerClient.connect("http://peer-b:50051", node_id="node-a")

# Define a local workflow that delegates to the peer.
@step
async def orchestrate(ctx: Context, ev: Event):
    result = await client.invoke_sub_workflow(
        workflow_name="analyze-pipeline",
        step_ids=["my_app::analyze", "my_app::summarize"],
        input={"document": ev.document},
        timeout_secs=60,
    )
    return StopEvent(result=result)

wf = Workflow("distributed-example", [orchestrate])
handler = await wf.run(document="...")
result = await handler.result()
```

Until the bindings land, you can call the Rust API directly from a Rust workflow step and expose the result to Python via the existing `StopEvent.result` bridge.

## Node.js example (planned API)

Node.js bindings follow the same pattern:

```typescript
import { Workflow, CompletionModel } from "blazen";
import { PeerClient } from "blazen/peer";

const client = await PeerClient.connect("http://peer-b:50051", "node-a");

const wf = new Workflow("distributed-example");

wf.addStep("orchestrate", ["blazen::StartEvent"], async (event, ctx) => {
  const result = await client.invokeSubWorkflow({
    workflowName: "analyze-pipeline",
    stepIds: ["my_app::analyze", "my_app::summarize"],
    input: { document: event.document },
    timeoutSecs: 60,
  });

  return { type: "blazen::StopEvent", result };
});

const result = await wf.run({ document: "..." });
console.log(result.data);
```

Until the Node bindings land, use the Rust core directly via a native addon or call the gRPC endpoint with any Node gRPC client library (`@grpc/grpc-js`).

## Session refs across machines

When a sub-workflow on the peer creates a session ref -- for example, a model weight cache or a GPU-resident tensor -- the value stays on the peer. The parent receives a `RemoteRefDescriptor` containing:

| Field | Type | Description |
|---|---|---|
| `origin_node_id` | `String` | Stable identifier of the node that owns the value. |
| `type_tag` | `String` | Type tag from `SessionRefSerializable::blazen_type_tag`. Used by the parent's deserializer to rehydrate the bytes. |
| `created_at_epoch_ms` | `u64` | Wall-clock creation time on the origin node. Useful for tracing and TTL bookkeeping. |

### Lazy dereference

The parent can fetch the underlying bytes at any time by calling `deref_session_ref` with the ref's `RegistryKey` (the UUID from the `SubWorkflowResponse.remote_refs` map):

```rust
use blazen_core::session_ref::RegistryKey;

for (uuid, descriptor) in &response.remote_refs {
    let key = RegistryKey(*uuid);
    let bytes = client.deref_session_ref(key).await?;
    // Deserialize `bytes` using the deserializer keyed by `descriptor.type_tag`.
}
```

### Release

When the parent no longer needs a remote ref, it should release it so the peer can free the memory:

```rust
let was_released = client.release_session_ref(key).await?;
```

Returns `true` if the ref was found and dropped, `false` if it was already gone (expired by lifetime policy or released by another caller).

### RefLifetime interaction

The `RefLifetime` policy on each session ref controls when it is automatically purged on the peer:

- **`UntilContextDrop`** (default) -- purged when the sub-workflow finishes. The ref must be serialized into the response or it is lost.
- **`UntilParentFinish`** -- survives the sub-workflow. Available for lazy `DerefSessionRef` until the parent explicitly releases it. This is the recommended policy for distributed workflows.
- **`UntilExplicitDrop`** -- never purged automatically. The parent must call `ReleaseSessionRef`.

Set the lifetime when inserting a ref in your step:

```rust
use blazen_core::session_ref::RefLifetime;

ctx.session_refs()
    .insert_with_lifetime(my_value, RefLifetime::UntilParentFinish)
    .await;
```

## mTLS configuration for production

In production, always enable mutual TLS between peers. The `blazen_peer::tls` module provides helpers that read PEM files and produce tonic TLS configs:

```rust
use std::path::Path;
use blazen_peer::tls::{load_server_tls, load_client_tls};

// Server side
let server_tls = load_server_tls(
    Path::new("/certs/server.crt"),
    Path::new("/certs/server.key"),
    Path::new("/certs/ca.crt"),
)?;

// Client side
let client_tls = load_client_tls(
    Path::new("/certs/client.crt"),
    Path::new("/certs/client.key"),
    Path::new("/certs/ca.crt"),
)?;
```

Both sides must present certificates signed by the same CA. In Kubernetes, use [cert-manager](https://cert-manager.io/) with a shared `Issuer` to automate certificate issuance and rotation for each peer pod.

When mTLS is not practical (development, internal networks), peers can authenticate with a shared secret token by setting the `BLAZEN_PEER_TOKEN` environment variable on both sides. See `blazen_peer::auth::resolve_peer_token`.

## Error handling

The `PeerError` enum covers all failure modes:

| Variant | Meaning |
|---|---|
| `PeerError::UnknownStep` | The peer does not have a requested step ID registered. |
| `PeerError::EnvelopeVersion` | The client sent an envelope version newer than the server supports. |
| `PeerError::Workflow` | The remote workflow ran but produced an error. |
| `PeerError::Transport` | Network-level failure (connection refused, timeout, TLS handshake failure). |
| `PeerError::Encode` | Postcard serialization or deserialization failed. |
| `PeerError::Tls` | TLS configuration error (bad PEM, missing key file). |

Additionally, the `SubWorkflowResponse.error` field carries workflow-level errors as a string when the sub-workflow itself fails (as opposed to transport or protocol errors).

## Envelope versioning

Every wire payload carries an `envelope_version` field (currently `1`). Adding optional fields at the end of a struct is forward-compatible and does not require a version bump. Renaming, reordering, or removing fields is a breaking change that requires incrementing the `ENVELOPE_VERSION` constant. The server rejects payloads with a version newer than it supports (`FAILED_PRECONDITION`) but accepts all older versions.

---

# Media Generation

Source: https://blazen.dev/docs/guides/media-generation
Section: guides

Blazen ships a unified compute provider for media generation through [fal.ai](https://fal.ai/) -- 600+ models covering image synthesis, video, TTS, music, 3D, background removal, and upscaling. The same `FalProvider` also acts as an `EmbeddingModel` and `CompletionModel`, so a single handle covers every fal capability.

## Overview

The provider implements a family of capability traits (`ImageGeneration`, `VideoGeneration`, `AudioGeneration`, `ThreeDGeneration`, `Transcription`, `BackgroundRemoval`). Each capability takes a typed request (`ImageRequest`, `VideoRequest`, `SpeechRequest`, `MusicRequest`, `ThreeDRequest`) and returns a typed result containing one or more `MediaOutput` objects with a URL, base64 payload, or raw text content.

Authentication: pass an API key via options, or set the `FAL_KEY` environment variable.

## Image generation

```python
from blazen import FalProvider, FalOptions, ImageRequest

fal = FalProvider(options=FalOptions(api_key="fal-..."))

result = await fal.generate_image(ImageRequest(
    prompt="a cat astronaut on Mars, cinematic lighting",
    width=1024,
    height=1024,
    num_images=2,
))

for img in result.images:
    print(img.media.url, img.width, img.height)
```

```typescript
import { FalProvider } from "blazen";

const fal = FalProvider.create({ apiKey: "fal-..." });

const result = await fal.generateImage({
  prompt: "a cat astronaut on Mars, cinematic lighting",
  width: 1024,
  height: 1024,
  numImages: 2,
});

for (const img of result.images) {
  console.log(img.media.url, img.width, img.height);
}
```

```rust
use blazen_llm::compute::{ImageGeneration, ImageRequest};
use blazen_llm::providers::fal::FalProvider;

let fal = FalProvider::new(std::env::var("FAL_KEY")?);

let result = fal
    .generate_image(
        ImageRequest::new("a cat astronaut on Mars, cinematic lighting")
            .with_size(1024, 1024)
            .with_count(2),
    )
    .await?;

for img in &result.images {
    if let Some(url) = &img.media.url {
        println!("{url}");
    }
}
```

## Upscaling and background removal

```python
from blazen import UpscaleRequest, BackgroundRemovalRequest

upscaled = await fal.upscale_image(UpscaleRequest(
    image_url="https://example.com/small.png",
    scale=4.0,
))

no_bg = await fal.remove_background(BackgroundRemovalRequest(
    image_url="https://example.com/product.jpg",
))
```

```typescript
const upscaled = await fal.upscaleImage({
  imageUrl: "https://example.com/small.png",
  scale: 4,
});

const noBg = await fal.removeBackground({
  imageUrl: "https://example.com/product.jpg",
});
```

`FalProvider` also exposes `upscale_image_aura`, `upscale_image_clarity`, and `upscale_image_creative` for the respective fal upscaler apps.

## Video generation

Both text-to-video and image-to-video are supported:

```python
from blazen import VideoRequest

clip = await fal.text_to_video(VideoRequest(
    prompt="a drone flying through a sunlit forest",
    duration_seconds=5.0,
    width=1920,
    height=1080,
))
print(clip.video.media.url, clip.video.duration_seconds)

from_image = await fal.image_to_video(VideoRequest(
    prompt="animate this painting",
    image_url="https://example.com/input.png",
    duration_seconds=4.0,
))
```

```typescript
const clip = await fal.textToVideo({
  prompt: "a drone flying through a sunlit forest",
  durationSeconds: 5,
  width: 1920,
  height: 1080,
});

const fromImage = await fal.imageToVideo({
  prompt: "animate this painting",
  imageUrl: "https://example.com/input.png",
  durationSeconds: 4,
});
```

## Text-to-speech, music, and sound effects

```python
from blazen import SpeechRequest, MusicRequest

speech = await fal.text_to_speech(SpeechRequest(
    text="Hello, world!",
    voice="af_heart",
    speed=1.0,
))
audio_url = speech.audio[0].media.url

music = await fal.generate_music(MusicRequest(
    prompt="upbeat lo-fi hip-hop",
    duration_seconds=30.0,
))

sfx = await fal.generate_sfx(MusicRequest(prompt="thunder clap"))
```

```typescript
const speech = await fal.textToSpeech({
  text: "Hello, world!",
  voice: "af_heart",
  speed: 1,
});

const music = await fal.generateMusic({
  prompt: "upbeat lo-fi hip-hop",
  durationSeconds: 30,
});

const sfx = await fal.generateSfx({ prompt: "thunder clap" });
```

```rust
use blazen_llm::compute::{AudioGeneration, MusicRequest, SpeechRequest};

let speech = fal
    .text_to_speech(
        SpeechRequest::new("Hello, world!")
            .with_voice("af_heart")
            .with_speed(1.0),
    )
    .await?;

let music = fal
    .generate_music(MusicRequest::new("upbeat lo-fi hip-hop").with_duration(30.0))
    .await?;
```

## 3D generation

```python
from blazen import ThreeDRequest

mesh = await fal.generate_3d(ThreeDRequest(
    prompt="a low-poly spaceship",
    format="glb",
))

from_image = await fal.generate_3d(ThreeDRequest.from_image(
    "https://example.com/photo.png",
).with_format("obj"))
```

```typescript
const mesh = await fal.generate3d({
  prompt: "a low-poly spaceship",
  format: "glb",
});
```

## Output format

Every result wraps one or more `MediaOutput` records. Each output exposes:

| Field | Type | Description |
|---|---|---|
| `url` | `str \| None` | Downloadable URL if the provider returned one. |
| `base64` | `str \| None` | Inline base64 payload, when the provider returned raw bytes. |
| `raw_content` | `str \| None` | Raw text for text-based formats (SVG, GLTF JSON, OBJ). |
| `media_type` | `MediaType` | Format enum plus `mime()` / `extension()` / `is_image()` helpers. |
| `file_size` | `int \| None` | Byte count if reported. |
| `metadata` | `dict` | Arbitrary provider-specific fields. |

## See also

- [Transcription](/guides/transcription/) -- convert audio to text with fal or whisper.cpp
- [Custom Providers](/guides/custom-providers/) -- wrap your own image/video/audio backend
- [Batch Completions](/guides/batch-processing/) -- run many LLM prompts concurrently

---

# Audio Transcription

Source: https://blazen.dev/docs/guides/transcription
Section: guides

Blazen's `Transcription` provider converts audio into text with optional timestamped segments, language detection, and speaker diarization. Two backends ship out of the box:

- **fal.ai** -- remote Whisper hosted on fal's compute platform. Accepts URL audio sources.
- **whisper.cpp** -- fully local, offline transcription. Accepts local file paths only.

Both expose the same `Transcription` handle, the same `TranscriptionRequest`, and return the same `TranscriptionResult` shape.

## Overview

`TranscriptionRequest` carries either a URL (`audio_url`) or a local path (via `from_file`). The result bundles the raw text, timestamped segments, and detected language along with the usual `timing` / `cost` / `metadata` block.

## fal.ai (remote)

```python
from blazen import Transcription, TranscriptionRequest, FalOptions

transcriber = Transcription.fal(options=FalOptions(api_key="fal-..."))

result = await transcriber.transcribe(TranscriptionRequest(
    audio_url="https://example.com/interview.mp3",
    language="en",
    diarize=True,
))

print(result.text)
for seg in result.segments:
    speaker = seg.speaker or "unknown"
    print(f"[{seg.start_seconds:.2f}-{seg.end_seconds:.2f}] {speaker}: {seg.text}")
```

```typescript
import { Transcription } from "blazen";

const transcriber = Transcription.fal({ apiKey: "fal-..." });

const result = await transcriber.transcribe({
  audioUrl: "https://example.com/interview.mp3",
  language: "en",
  diarize: true,
});

console.log(result.text);
for (const seg of result.segments) {
  console.log(`[${seg.startSeconds}-${seg.endSeconds}]`, seg.speaker ?? "unknown", seg.text);
}
```

```rust
use blazen_llm::compute::{Transcription, TranscriptionRequest};
use blazen_llm::providers::fal::FalProvider;

let fal = FalProvider::new(std::env::var("FAL_KEY")?);

let result = fal
    .transcribe(
        TranscriptionRequest::new("https://example.com/interview.mp3")
            .with_language("en")
            .with_diarize(true),
    )
    .await?;

println!("{}", result.text);
for seg in &result.segments {
    println!("[{:.2}-{:.2}] {}", seg.start_seconds, seg.end_seconds, seg.text);
}
```

## whisper.cpp (local)

whisper.cpp runs entirely on-device. The first call downloads the GGML model file (32 MB for `Tiny`, up to 3.1 GB for `LargeV3`) into the cache directory and reuses it afterwards. No API key or network access is required for subsequent runs.

Audio input must be **16-bit PCM mono WAV at 16 kHz**. URL sources are not supported -- use `TranscriptionRequest.from_file` with a local path.

```python
from blazen import Transcription, TranscriptionRequest, WhisperOptions, WhisperModel

transcriber = Transcription.whispercpp(options=WhisperOptions(
    model=WhisperModel.Base,
    language="en",
    cache_dir="/tmp/whisper-models",
))

result = await transcriber.transcribe(
    TranscriptionRequest.from_file("/path/to/audio.wav")
)
print(result.text)
```

```typescript
import { Transcription } from "blazen";

// Node.js currently exposes whisper.cpp via the Rust crate feature flag --
// see the Rust example below for full control over model size and device.
const transcriber = Transcription.fal(); // for remote
```

```rust
use blazen_audio_whispercpp::{WhisperCppProvider, WhisperOptions, WhisperModel};
use blazen_llm::compute::{Transcription, TranscriptionRequest};

let provider = WhisperCppProvider::new(
    WhisperOptions::new()
        .with_model(WhisperModel::Base)
        .with_language("en"),
)?;

let result = provider
    .transcribe(TranscriptionRequest::from_file("/path/to/audio.wav"))
    .await?;

println!("{}", result.text);
```

### Model sizes

| Variant | Parameters | Download size | Relative speed |
|---|---|---|---|
| `Tiny` | 39 M | ~32 MB | fastest |
| `Base` | 74 M | ~74 MB | fast |
| `Small` | 244 M | ~244 MB | balanced |
| `Medium` | 769 M | ~769 MB | slower |
| `LargeV3` | 1550 M | ~3.1 GB | highest quality |

Enable GPU acceleration with the `cuda`, `metal`, or `coreml` feature flags on `blazen-audio-whispercpp` at build time.

## TranscriptionResult shape

| Field | Type | Description |
|---|---|---|
| `text` | `str` | Full transcript concatenated from all segments. |
| `segments` | `list[TranscriptionSegment]` | Timestamped utterances with optional `speaker` labels when diarization is enabled. |
| `language` | `str \| None` | Detected ISO 639-1 language code. |
| `timing` | `RequestTiming` | Queue, execution, and total latency breakdown. |
| `cost` | `float \| None` | USD cost if reported by the provider. |
| `metadata` | `dict` | Raw provider-specific fields. |

## Custom backends

Subclass `Transcription` (Python/Node) or implement the `Transcription` trait (Rust) to plug in your own provider -- AssemblyAI, Deepgram, a self-hosted Whisper endpoint, etc. See [Custom Providers](/guides/custom-providers/) for the full pattern.

## See also

- [Media Generation](/guides/media-generation/) -- TTS, music, image/video generation via the same provider family
- [Local Inference](/guides/local-inference/) -- model loading and VRAM management for on-device backends
- [Custom Providers](/guides/custom-providers/) -- bring your own transcription backend

---

# Memory & Semantic Search

Source: https://blazen.dev/docs/guides/memory
Section: guides

`Memory` is Blazen's document store for retrieval-augmented generation (RAG), chat history, and similarity search. It pairs an optional `EmbeddingModel` with a pluggable `MemoryBackend` and indexes entries using [ELID](https://crates.io/crates/elid) (embedding-based) plus SimHash (local) for fast approximate nearest-neighbor lookup.

## Overview

`Memory` operates in two modes:

- **Full mode** (`Memory(embedder, backend)`) -- an embedding model produces dense vectors. Both semantic `search()` and lightweight `search_local()` are available.
- **Local-only mode** (`Memory.local(backend)`) -- no embedder. Only `search_local()` works, using character-level SimHash. Cheap, fast, and useful for fuzzy string matching when you don't want the cost of an embedding call on every query.

Every entry carries an `id`, `text`, and optional `metadata` dict. Metadata filters let you scope queries to a subset of the store.

## Basic usage

```python
from blazen import Memory, InMemoryBackend, EmbeddingModel, ProviderOptions

embedder = EmbeddingModel.openai(options=ProviderOptions(api_key="sk-..."))
memory = Memory(embedder, InMemoryBackend())

await memory.add("paris", "Paris is the capital of France.", {"category": "geo"})
await memory.add("rome", "Rome is the capital of Italy.", {"category": "geo"})
await memory.add("python", "Python is a programming language.", {"category": "tech"})

results = await memory.search("capital city in Europe", limit=2)
for r in results:
    print(f"{r.score:.3f}  {r.id}  {r.text}")
```

```typescript
import { Memory, InMemoryBackend, EmbeddingModel } from "blazen";

const embedder = EmbeddingModel.openai({ apiKey: "sk-..." });
const memory = new Memory(embedder, new InMemoryBackend());

await memory.add("paris", "Paris is the capital of France.", { category: "geo" });
await memory.add("rome", "Rome is the capital of Italy.", { category: "geo" });
await memory.add("python", "Python is a programming language.", { category: "tech" });

const results = await memory.search("capital city in Europe", 2);
for (const r of results) {
  console.log(r.score.toFixed(3), r.id, r.text);
}
```

```rust
use blazen_memory::{InMemoryBackend, Memory, MemoryEntry, MemoryStore};
use blazen_llm::EmbeddingModel;
use std::sync::Arc;

let embedder: Arc<dyn EmbeddingModel> = /* ... */;
let memory = Memory::new(embedder, InMemoryBackend::new());

memory
    .add(vec![
        MemoryEntry::new("Paris is the capital of France.").with_id("paris"),
        MemoryEntry::new("Rome is the capital of Italy.").with_id("rome"),
    ])
    .await?;

let results = memory.search("capital city in Europe", 2, None).await?;
for r in results {
    println!("{:.3}  {}  {}", r.score, r.id, r.text);
}
```

## Metadata filtering

Metadata filters are a "superset" match: entries whose metadata contains every key/value pair in the filter are returned. Other keys are ignored.

```python
geo_only = await memory.search(
    "European city",
    limit=5,
    metadata_filter={"category": "geo"},
)
```

```typescript
const geoOnly = await memory.search("European city", 5, { category: "geo" });
```

## Local (SimHash) search

When you don't have (or don't want) an embedding model:

```python
memory = Memory.local(InMemoryBackend())
await memory.add("greeting", "Hello world!")
hits = await memory.search_local("hello", limit=5)
```

```typescript
const memory = Memory.local(new InMemoryBackend());
await memory.add("greeting", "Hello world!");
const hits = await memory.searchLocal("hello", 5);
```

## Browser / WASM construction

The `@blazen/sdk` package ships a standalone `InMemoryBackend` class plus convenience factory methods on `Memory` that take an explicit backend instance. This is the recommended construction path in the browser, since it lets you hold a reference to the backend (for inspection, replacement, or sharing) instead of having `Memory` own it implicitly.

```typescript
import { InMemoryBackend, Memory } from "@blazen/sdk";

const backend = new InMemoryBackend();
const memory = Memory.fromBackend(embeddingModel, backend);
```

For SimHash-only local search without an embedder, use `Memory.localFromBackend`:

```typescript
import { InMemoryBackend, Memory } from "@blazen/sdk";

const backend = new InMemoryBackend();
const localMemory = Memory.localFromBackend(backend);
await localMemory.add("greeting", "Hello world!");
```

Queries return `MemoryResult` instances -- a typed value class with `id`, `content`, `score`, and metadata fields. Prefer it over loose objects when you want the TypeScript compiler to catch typos in result handling:

```typescript
import type { MemoryResult } from "@blazen/sdk";

const results: MemoryResult[] = await memory.query("capital city in Europe", 2);
for (const r of results) {
  console.log(r.score.toFixed(3), r.id, r.content);
}
```

The full `MemoryResult` field list is in the generated types at `crates/blazen-wasm-sdk/pkg/blazen_wasm_sdk.d.ts`.

This shape is wasm-sdk specific. Node, Python, and Rust use the existing `MemoryBackend` trait/class hierarchy described below, with their own `Memory` constructors.

## Built-in backends

| Backend | Storage | Notes |
|---|---|---|
| `InMemoryBackend` | Process memory | Fastest; vanishes on shutdown. |
| `JsonlBackend` | JSONL file on disk | Loads on startup, appends on insert, rewrites on update/delete. |
| `ValkeyBackend` | [Valkey](https://valkey.io/) / Redis | Shared across processes; durable when Valkey persists. |

```python
from blazen import JsonlBackend, ValkeyBackend

jsonl = JsonlBackend("./memory.jsonl")
valkey = await ValkeyBackend.connect("redis://localhost:6379", namespace="prod:memory")

memory_a = Memory(embedder, jsonl)
memory_b = Memory(embedder, valkey)
```

```typescript
const jsonl = await JsonlBackend.create("./memory.jsonl");
const memory = Memory.withJsonl(embedder, jsonl);
```

## Custom backends

Subclass `MemoryBackend` to plug in Postgres, DynamoDB, SQLite, or any other store. The backend must implement `put`, `get`, `delete`, `list`, `len`, and `search_by_bands` -- Blazen calls `search_by_bands` with the LSH hashes it needs to resolve and does final similarity ranking in-process.

```python
from blazen import MemoryBackend

class SqliteBackend(MemoryBackend):
    def __init__(self, conn):
        super().__init__()
        self._conn = conn

    async def put(self, entry):
        self._conn.execute(
            "INSERT OR REPLACE INTO entries(id, text, metadata, bands) VALUES (?,?,?,?)",
            (entry["id"], entry["text"], json.dumps(entry["metadata"]), json.dumps(entry["bands"])),
        )

    async def get(self, id):
        row = self._conn.execute("SELECT * FROM entries WHERE id=?", (id,)).fetchone()
        return None if row is None else row_to_entry(row)

    async def delete(self, id): ...
    async def list(self): ...
    async def len(self): ...
    async def search_by_bands(self, bands, limit): ...

memory = Memory(embedder, SqliteBackend(sqlite3.connect(":memory:")))
```

See [Custom Providers](/guides/custom-providers/) for the full subclassing pattern, including error handling and lifecycle expectations.

## CRUD operations

```python
await memory.add("doc1", "text...", metadata={"tag": "v1"})
entry = await memory.get("doc1")                 # { id, text, metadata, ... } or None
deleted = await memory.delete("doc1")             # -> bool
count = await memory.count()                      # -> int
```

```typescript
await memory.add("doc1", "text...", { tag: "v1" });
const entry = await memory.get("doc1");           // JsMemoryEntry | null
const deleted = await memory.delete("doc1");       // boolean
const count = await memory.count();                // number
```

## See also

- [Embeddings](/guides/python/embeddings/) -- building blocks for the `Memory` embedder
- [Custom Providers](/guides/custom-providers/) -- subclassing `MemoryBackend` and friends
- [Local Inference](/guides/local-inference/) -- drop `EmbeddingModel.embed()` in for offline semantic search

---

# Prompt Templates

Source: https://blazen.dev/docs/guides/prompts
Section: guides

Blazen ships a lightweight prompt templating layer for reusing system prompts and user-message scaffolds across steps, agents, and workflows. Templates use `{{variable}}` placeholders, carry a chat role, and render into `ChatMessage` objects that plug straight into `CompletionModel.complete()`.

## Overview

- **`PromptTemplate`** -- a named template with a role (`system` / `user` / `assistant`), a template string, and the set of variables it needs.
- **`PromptRegistry`** -- a versioned collection of templates loadable from YAML or JSON files and indexed by name.

## Basic usage

```python
from blazen import PromptTemplate, ChatMessage

template = PromptTemplate(
    "Summarise the following {{doc_type}} in {{style}} style.",
    role="system",
    name="summariser",
)

message = template.render({"doc_type": "article", "style": "concise"})
print(message.content)
# -> "Summarise the following article in concise style."
```

```typescript
import { PromptTemplate } from "blazen";

const template = new PromptTemplate(
  "Summarise the following {{doc_type}} in {{style}} style.",
  { role: "system", name: "summariser" },
);

const msg = template.render({ doc_type: "article", style: "concise" });
console.log(msg.content);
```

Pass the rendered `ChatMessage` straight into the model:

```python
from blazen import CompletionModel, ChatMessage

model = CompletionModel.openai()
response = await model.complete([
    template.render({"doc_type": "article", "style": "concise"}),
    ChatMessage.user(document_text),
])
```

## Registering templates

A `PromptRegistry` indexes named templates and supports version pinning. Use it when your application has more than a handful of prompts or when prompts are maintained by a separate team.

```python
from blazen import PromptRegistry, PromptTemplate

registry = PromptRegistry()
registry.register("summariser", PromptTemplate(
    "Summarise the following {{doc_type}}.",
    role="system",
))
registry.register("translator", PromptTemplate(
    "Translate the following text into {{lang}}.",
    role="system",
))

msg = registry.render("summariser", {"doc_type": "meeting transcript"})
```

```typescript
import { PromptRegistry, PromptTemplate } from "blazen";

const registry = new PromptRegistry();
registry.register(
  "summariser",
  new PromptTemplate("Summarise the following {{doc_type}}.", { role: "system" }),
);
registry.register(
  "translator",
  new PromptTemplate("Translate the following text into {{lang}}.", { role: "system" }),
);

const msg = registry.render("summariser", { doc_type: "meeting transcript" });
```

## Loading from disk

YAML (or JSON) files let you keep prompts out of the codebase:

```yaml
# prompts/summariser.yaml
name: summariser
version: "2"
role: system
template: |
  Summarise the following {{doc_type}} in {{style}} style.
  Limit your response to {{max_words}} words.
```

```python
from blazen import PromptRegistry

registry = PromptRegistry.from_file("prompts/summariser.yaml")
# Or load every .yaml/.yml/.json file in a directory:
registry = PromptRegistry.from_dir("prompts")
```

```typescript
const registry = PromptRegistry.fromFile("prompts/summariser.yaml");
// Or:
const dirRegistry = PromptRegistry.fromDir("prompts");
```

## Variables

Templates extract variable names eagerly at construction time, so you can inspect them and validate inputs up-front:

```python
tmpl = PromptTemplate("Hello {{name}}, welcome to {{place}}!")
tmpl.variables
# -> ["name", "place"]
```

```typescript
const tmpl = new PromptTemplate("Hello {{name}}, welcome to {{place}}!");
console.log(tmpl.variables);
// -> ["name", "place"]
```

Rendering fails loudly if a required variable is missing, so typos surface at runtime rather than producing malformed prompts.

## See also

- [Chat Window](/guides/chat-window/) -- token-limited conversation history that pairs naturally with rendered system prompts
- [Custom Providers](/guides/custom-providers/) -- bring your own completion model while still using the registry
- [Batch Completions](/guides/batch-processing/) -- fan rendered prompts across many models in parallel

---

# Batch Completions

Source: https://blazen.dev/docs/guides/batch-processing
Section: guides

`complete_batch` (Python) / `completeBatch` (Node) / `blazen_llm::batch::complete_batch` (Rust) drives a `CompletionModel` with a list of independent conversations in parallel, capped by a configurable concurrency limit. It preserves input order, reports per-request success/failure, and aggregates token usage and cost across the batch.

## Overview

Use batch completion when you have many short, independent prompts -- classification, labelling, scoring, RAG retrievers that fan out across chunks. It is **not** a replacement for the OpenAI "Batch API" (half-price, 24-hour latency); Blazen's batch runs every request live and returns as fast as the slowest request in the flight completes.

Key properties:

- **Bounded concurrency** -- a semaphore caps in-flight requests. `0` means unlimited.
- **Partial failures** -- each request is awaited independently. One failure does not cancel the rest.
- **Order-preserving** -- the output list lines up 1:1 with the input list.
- **Aggregated usage** -- `total_usage` and `total_cost` sum across successful responses.

## Basic usage

```python
from blazen import CompletionModel, ChatMessage, complete_batch

model = CompletionModel.openai()

conversations = [
    [ChatMessage.user("What is 2 + 2?")],
    [ChatMessage.user("What is the capital of France?")],
    [ChatMessage.user("Who wrote Hamlet?")],
]

result = await complete_batch(model, conversations, concurrency=4)

for i, resp in enumerate(result.responses):
    if resp is not None:
        print(f"[{i}] {resp.content}")
    else:
        print(f"[{i}] ERROR: {result.errors[i]}")

print("Total tokens:", result.total_usage)
print("Total cost:  $", result.total_cost)
```

```typescript
import { CompletionModel, ChatMessage, completeBatch } from "blazen";

const model = CompletionModel.openai({ apiKey: process.env.OPENAI_API_KEY });

const result = await completeBatch(
  model,
  [
    [ChatMessage.user("What is 2 + 2?")],
    [ChatMessage.user("What is the capital of France?")],
    [ChatMessage.user("Who wrote Hamlet?")],
  ],
  { concurrency: 4 },
);

for (let i = 0; i < result.responses.length; i++) {
  const resp = result.responses[i];
  if (resp) {
    console.log(`[${i}]`, resp.content);
  } else {
    console.error(`[${i}] ERROR:`, result.errors[i]);
  }
}
```

```rust
use blazen_llm::batch::{complete_batch, BatchConfig};
use blazen_llm::{ChatMessage, CompletionRequest};
use blazen_llm::providers::openai::OpenAiProvider;
use blazen_llm::traits::CompletionModel;

let model = OpenAiProvider::from_env()?;

let requests = vec![
    CompletionRequest::new(vec![ChatMessage::user("What is 2 + 2?")]),
    CompletionRequest::new(vec![ChatMessage::user("What is the capital of France?")]),
    CompletionRequest::new(vec![ChatMessage::user("Who wrote Hamlet?")]),
];

let result = complete_batch(&model, requests, BatchConfig::new(4)).await;

for (i, response) in result.responses.iter().enumerate() {
    match response {
        Ok(r) => println!("[{i}] {}", r.content.as_deref().unwrap_or("")),
        Err(e) => eprintln!("[{i}] ERROR: {e}"),
    }
}
```

## Applying options to every request

Pass a shared `CompletionOptions` / `JsCompletionOptions` to apply temperature, max tokens, or a tool set to every request in the flight:

```python
from blazen import CompletionOptions

result = await complete_batch(
    model,
    conversations,
    concurrency=8,
    options=CompletionOptions(temperature=0.2, max_tokens=200),
)
```

```typescript
const result = await completeBatch(model, conversations, {
  concurrency: 8,
  temperature: 0.2,
  maxTokens: 200,
});
```

## Handling partial failures

Each element of `result.responses` is either a completion or `None` / `null`. The matching index in `result.errors` holds the error message when a request failed. This lets you retry only the failing subset or surface a structured error to the caller without losing the successful answers.

```python
failed_indices = [i for i, r in enumerate(result.responses) if r is None]
print(f"{len(failed_indices)} of {len(conversations)} requests failed")
```

## BatchResult

| Field | Type | Description |
|---|---|---|
| `responses` | `list[CompletionResponse \| None]` | Per-request results in input order. |
| `errors` | `list[str \| None]` | Per-request error messages. `None` when the request succeeded. |
| `total_usage` | `dict \| None` | Summed `prompt_tokens`, `completion_tokens`, and `total_tokens` across successes. |
| `total_cost` | `float \| None` | Summed USD cost across successes (only set when the provider reports pricing). |

The Rust version returns `BatchResult` with `responses: Vec<Result<CompletionResponse, BlazenError>>` instead -- it does not split success and error into separate vectors.

### Node `BatchResult` class

The Node binding returns a typed `BatchResult` class. Field access works through getters, so the snippet above (`result.responses[i]`, `result.errors[i]`) reads exactly like a plain object, but you also get richer summary accessors and a printable form:

| Accessor | Type | Description |
|---|---|---|
| `.responses` | `(CompletionResponse \| null)[]` | Per-request results in input order. |
| `.errors` | `(BlazenError \| null)[]` | Per-request errors. `null` when the request succeeded. |
| `.totalUsage` | `TokenUsage` | Summed `promptTokens`, `completionTokens`, and `totalTokens` across successes. |
| `.totalCost` | `number` | Summed USD cost across successes (zero when no provider in the flight reports pricing). |
| `.successCount` | `number` | Number of requests that produced a `CompletionResponse`. |
| `.failureCount` | `number` | Number of requests that produced a `BlazenError`. |
| `.length` | `number` | Total request count -- always `successCount + failureCount`. |
| `.toString()` | `string` | Human-readable summary, useful for logs. |

```typescript
import { BatchResult, completeBatch } from "blazen";

const result = await completeBatch(model, conversations, { concurrency: 8 });

if (result instanceof BatchResult) {
  console.log(`${result.successCount}/${result.length} succeeded`);
  console.log("usage:", result.totalUsage);
  console.log("cost:  $", result.totalCost);
  console.log(result.toString());
}
```

`instanceof BatchResult` narrows the value for TypeScript and is the canonical way to discriminate the result from any wrapping union you build around it.

The Python `BatchResult` mirrors the same shape: `.responses`, `.errors`, `.total_usage`, `.total_cost`, `.success_count`, `.failure_count`, plus `__len__` so `len(result)` returns the total request count.

## Choosing a concurrency level

- `0` (unlimited) is fine for fast providers with generous rate limits and small batches (under 100 requests).
- For rate-limited providers, set `concurrency` to your per-second budget divided by expected per-request latency.
- When combining with [`with_retry`](/api/python/), remember the semaphore slot is held for the full retry chain of a single request -- budget accordingly.

## See also

- [Custom Providers](/guides/custom-providers/) -- batch also works with subclassed `CompletionModel`s
- [Prompt Templates](/guides/prompts/) -- render a templated system prompt once and fan it across many user messages
- [Chat Window](/guides/chat-window/) -- build each conversation within a token budget before batching

---

# Custom Providers (Subclassing)

Source: https://blazen.dev/docs/guides/custom-providers
Section: guides

Blazen's capability types are subclassable from every SDK. You write an ordinary class that inherits from `CompletionModel`, `EmbeddingModel`, `TTSProvider`, `ImageProvider`, `MemoryBackend`, or any of the other base classes, override the relevant async methods, and Blazen's workflow engine, agents, retry/cache wrappers, `Memory`, and batch helpers consume your class exactly like a built-in.

This is the release's flagship integration point: if an API, a local binary, a gRPC service, or a kernel-level SDK can be reached from Python, Node, or Rust, you can wire it into Blazen as a first-class provider.

## Why subclass instead of wrapping?

A subclassed provider is indistinguishable to Blazen from a built-in one. That unlocks:

- **Workflow step compatibility** -- use your provider with `@step` / `wf.addStep` / `#[step]` without adapter code.
- **Agent and tool integrations** -- agents, tool-calling loops, and streaming consumers all dispatch through the same trait.
- **`with_retry`, `with_cache`, `with_fallback`** -- transient-error handling and caching compose on top of any `CompletionModel`, built-in or subclassed (with a few current limitations -- see below).
- **`Memory` drop-in** -- swap OpenAI embeddings for a self-hosted sentence-transformers endpoint without touching any retrieval code.
- **Batch execution** -- `complete_batch` / `completeBatch` works with subclassed models.

The alternative -- wrapping your HTTP client inside a workflow step -- forfeits all of the above.

## Available base classes

| Capability | Python base class | Node base class | Key method(s) |
|---|---|---|---|
| Chat completion | `CompletionModel` | `CompletionModel` | `complete`, `stream` |
| Embeddings | `EmbeddingModel` | `EmbeddingModel` | `embed` |
| Text-to-speech | `TTSProvider` | via `CustomProvider` | `text_to_speech` |
| Music / SFX | `MusicProvider` | via `CustomProvider` | `generate_music`, `generate_sfx` |
| Image generation | `ImageProvider` | via `CustomProvider` | `generate_image`, `upscale_image` |
| Video generation | `VideoProvider` | via `CustomProvider` | `text_to_video`, `image_to_video` |
| 3D model generation | `ThreeDProvider` | via `CustomProvider` | `generate_3d` |
| Background removal | `BackgroundRemovalProvider` | via `CustomProvider` | `remove_background` |
| Transcription | `Transcription` | via `CustomProvider` | `transcribe` |
| Voice cloning | `VoiceProvider` | via `CustomProvider` | `clone_voice`, `list_voices`, `delete_voice` |
| Memory backend | `MemoryBackend` | `MemoryBackend` | `put`, `get`, `delete`, `list`, `len`, `search_by_bands` |

Node.js wraps media providers through a single `CustomProvider` class that dispatches to whichever of the standard method names your host object implements -- the capability is inferred from the methods present, not declared upfront.

## Custom completion model

```python
from blazen import CompletionModel, CompletionResponse, StreamChunk, ChatMessage
import httpx

class MyLlmProvider(CompletionModel):
    def __init__(self, api_key: str):
        super().__init__(model_id="my-model-v1", context_length=8192)
        self._client = httpx.AsyncClient(
            base_url="https://api.example.com/v1",
            headers={"Authorization": f"Bearer {api_key}"},
        )

    async def complete(self, messages, options=None):
        payload = {
            "model": "my-model-v1",
            "messages": [{"role": m.role, "content": m.content} for m in messages],
            "temperature": options.temperature if options else 0.7,
        }
        resp = await self._client.post("/chat/completions", json=payload)
        resp.raise_for_status()
        data = resp.json()
        return CompletionResponse(
            content=data["choices"][0]["message"]["content"],
            model="my-model-v1",
            finish_reason=data["choices"][0]["finish_reason"],
        )

    async def stream(self, messages, on_chunk=None, options=None):
        async with self._client.stream("POST", "/chat/completions", json={
            "model": "my-model-v1",
            "messages": [{"role": m.role, "content": m.content} for m in messages],
            "stream": True,
        }) as r:
            async for line in r.aiter_lines():
                if line.startswith("data: "):
                    chunk = StreamChunk(delta=line[6:])
                    if on_chunk:
                        on_chunk(chunk)
                    yield chunk

model = MyLlmProvider(api_key="...")
response = await model.complete([ChatMessage.user("Hello!")])
```

```typescript
import { CompletionModel, ChatMessage } from "blazen";

class MyLlmProvider extends CompletionModel {
  constructor(apiKey: string) {
    super({ modelId: "my-model-v1", contextLength: 8192 });
    this._apiKey = apiKey;
  }

  async complete(messages: ChatMessage[], options?: unknown) {
    const resp = await fetch("https://api.example.com/v1/chat/completions", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${this._apiKey}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "my-model-v1",
        messages: messages.map((m) => ({ role: m.role, content: m.content })),
      }),
    });
    const data = await resp.json();
    return {
      content: data.choices[0].message.content,
      model: "my-model-v1",
      finishReason: data.choices[0].finish_reason,
    };
  }

  async stream(messages: ChatMessage[], onChunk?: (c: unknown) => void) {
    // ... SSE parsing, invoke onChunk per event ...
  }

  private _apiKey: string;
}

const model = new MyLlmProvider(process.env.MY_API_KEY!);
const response = await model.complete([ChatMessage.user("Hello!")]);
```

```rust
use async_trait::async_trait;
use blazen_llm::{
    BlazenError, ChatMessage, CompletionModel, CompletionRequest, CompletionResponse,
    ProviderConfig, StreamChunk,
};
use futures_util::Stream;
use std::pin::Pin;

pub struct MyLlmProvider {
    config: ProviderConfig,
    api_key: String,
}

#[async_trait]
impl CompletionModel for MyLlmProvider {
    fn model_id(&self) -> &str {
        "my-model-v1"
    }

    fn provider_config(&self) -> Option<&ProviderConfig> {
        Some(&self.config)
    }

    async fn complete(&self, request: CompletionRequest) -> Result<CompletionResponse, BlazenError> {
        // Issue HTTP request, parse response, return CompletionResponse.
        todo!()
    }

    async fn stream(
        &self,
        request: CompletionRequest,
    ) -> Result<Pin<Box<dyn Stream<Item = Result<StreamChunk, BlazenError>> + Send>>, BlazenError>
    {
        // Issue streaming request, adapt to Stream<Item = Result<StreamChunk, ...>>.
        todo!()
    }
}
```

## Custom embedding model

```python
from blazen import EmbeddingModel, EmbeddingResponse

class MyEmbeddings(EmbeddingModel):
    def __init__(self):
        super().__init__(model_id="mini-lm-v6", dimensions=384)

    async def embed(self, texts):
        vectors = await my_http_embed(texts)  # your API call
        return EmbeddingResponse(
            embeddings=vectors,
            model="mini-lm-v6",
        )

model = MyEmbeddings()
resp = await model.embed(["hello", "world"])
```

```typescript
import { EmbeddingModel } from "blazen";

class MyEmbeddings extends EmbeddingModel {
  constructor() {
    super({ modelId: "mini-lm-v6", dimensions: 384 });
  }

  async embed(texts: string[]) {
    const vectors = await myHttpEmbed(texts);
    return { embeddings: vectors, model: "mini-lm-v6" };
  }
}
```

The custom embedder plugs straight into `Memory`:

```python
from blazen import Memory, InMemoryBackend
memory = Memory(MyEmbeddings(), InMemoryBackend())
```

## Custom TTS / image / video / 3D providers (Python)

Subclass the matching base class and override exactly one method. The dispatcher wires your subclass into the relevant capability trait automatically.

```python
from blazen import TTSProvider, AudioResult, GeneratedAudio, MediaOutput

class ElevenLabsTTS(TTSProvider):
    def __init__(self, api_key: str):
        super().__init__(provider_id="elevenlabs")
        self._api_key = api_key

    async def text_to_speech(self, request):
        audio_bytes = await eleven_labs_tts(self._api_key, request.text, request.voice)
        return AudioResult(
            audio=[GeneratedAudio(
                media=MediaOutput.from_base64(
                    base64.b64encode(audio_bytes).decode(),
                    media_type="mpeg",
                ),
            )],
        )
```

The same pattern applies to `ImageProvider`, `VideoProvider`, `MusicProvider`, `ThreeDProvider`, `BackgroundRemovalProvider`, and `VoiceProvider`. Each exposes a `provider_id`, optional `base_url`, optional `pricing`, and optional `vram_estimate_bytes` -- fill in only the knobs that make sense for your backend.

## Custom providers (Node.js)

Node.js uses a single `CustomProvider` wrapper: write a plain class with `async` methods named after the capabilities you support and wrap it.

```typescript
import { CustomProvider } from "blazen";

class ElevenLabsTTS {
  constructor(private apiKey: string) {}

  async textToSpeech(request: { text: string; voice?: string }) {
    const audio = await elevenLabs(this.apiKey, request);
    return {
      audio: [{
        media: {
          base64: audio.toString("base64"),
          mediaType: "mpeg",
        },
      }],
      timing: { totalMs: 0 },
      metadata: {},
    };
  }

  async cloneVoice(request) { /* ... */ }
  async listVoices() { /* ... */ }
}

const provider = new CustomProvider(new ElevenLabsTTS("..."), {
  providerId: "elevenlabs",
});

const result = await provider.textToSpeech({ text: "hi", voice: "rachel" });
```

The wrapper inspects which methods exist on the host object; calls to missing methods surface as `UnsupportedError`, matching the behaviour of built-in providers that only implement some of the capability traits.

## Custom memory backend

```python
from blazen import MemoryBackend, Memory, EmbeddingModel, ProviderOptions
import json, sqlite3

class SqliteBackend(MemoryBackend):
    def __init__(self, path: str):
        super().__init__()
        self._conn = sqlite3.connect(path)
        self._conn.execute("""
            CREATE TABLE IF NOT EXISTS entries (
                id TEXT PRIMARY KEY,
                text TEXT NOT NULL,
                metadata TEXT,
                bands TEXT NOT NULL
            )
        """)

    async def put(self, entry):
        self._conn.execute(
            "INSERT OR REPLACE INTO entries(id, text, metadata, bands) VALUES (?,?,?,?)",
            (entry["id"], entry["text"], json.dumps(entry.get("metadata", {})), json.dumps(entry["bands"])),
        )
        self._conn.commit()

    async def get(self, id):
        row = self._conn.execute(
            "SELECT id, text, metadata, bands FROM entries WHERE id = ?", (id,)
        ).fetchone()
        return None if row is None else {
            "id": row[0], "text": row[1],
            "metadata": json.loads(row[2] or "{}"),
            "bands": json.loads(row[3]),
        }

    async def delete(self, id):
        cur = self._conn.execute("DELETE FROM entries WHERE id = ?", (id,))
        self._conn.commit()
        return cur.rowcount > 0

    async def list(self):
        return [await self.get(r[0]) for r in self._conn.execute("SELECT id FROM entries")]

    async def len(self):
        return self._conn.execute("SELECT COUNT(*) FROM entries").fetchone()[0]

    async def search_by_bands(self, bands, limit):
        # Return candidate entries whose band set intersects the query bands.
        # Blazen does the final similarity ranking.
        ...

memory = Memory(EmbeddingModel.openai(), SqliteBackend("./memory.db"))
```

## Custom progress callback

`ProgressCallback` is the subclassable hook for download and load progress reporting -- model weight downloads, GGUF/safetensors loads, and any other long-running fetch path inside Blazen accept an instance and call its progress method as bytes arrive. The base class ships with a no-op default, so subclasses only need to override the one method they care about.

```python
from blazen import ProgressCallback

class LoggingProgress(ProgressCallback):
    def on_progress(self, downloaded: int, total: int | None) -> None:
        if total:
            pct = downloaded / total * 100
            print(f"{pct:.1f}% ({downloaded}/{total})")
```

```typescript
import { ProgressCallback } from "blazen";

class LoggingProgress extends ProgressCallback {
  override onProgress(downloaded: bigint, total?: bigint): void {
    if (total) {
      const pct = (Number(downloaded) / Number(total) * 100).toFixed(1);
      console.log(`${pct}% (${downloaded}/${total})`);
    }
  }
}
```

Pass the instance to any API that accepts a `ProgressCallback` -- most commonly the model-cache download paths driven by `ModelManager` and the local-inference loaders. The default `on_progress` / `onProgress` is a no-op, so subclasses only override the method (no `super().__init__(...)` arguments required) and the dispatcher takes care of the rest. `total` is `None` / `undefined` whenever the upstream source does not advertise a `Content-Length`, so always guard on it before computing a percentage.

## Integrating with retry, cache, and agents

Subclassed `CompletionModel`s work with Blazen's workflow engine, `complete_batch`, streaming consumers, `ChatWindow`, and `Memory` out of the box. They also serve as the primary model inside an `Agent`.

The wrapper decorators currently have the following behaviour on subclassed models:

- **`with_fallback`** -- requires every model in the chain to be a built-in provider. Passing a subclass raises immediately with a clear error. If you need fallback for a custom provider, implement retry inside your subclass's `complete` / `stream` methods.
- **`with_retry`** / **`with_cache`** -- not yet supported on subclassed models. The wrappers raise at construction time. Implement retry logic inside your own `complete` override (it is usually a few lines around `httpx` / `reqwest`) and cache via an external layer such as Redis.

These restrictions exist because the wrappers assume a specific internal `Arc<dyn CompletionModel>` shape that subclassed models do not yet expose. Lifting them is on the roadmap; track the issue in the blazen-llm crate if you need it.

Every other integration -- streaming, tool-calling, agents, workflows, batch, memory, prompts, chat window -- works identically to built-in providers.

## Error handling

Raise Blazen-compatible exceptions from your override and they surface as `BlazenError` to callers:

```python
from blazen import BlazenError

class MyProvider(CompletionModel):
    async def complete(self, messages, options=None):
        try:
            return await self._call_api(messages)
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                raise BlazenError.rate_limited(f"upstream rate limit: {e}")
            raise BlazenError.provider("my-provider", str(e))
```

For Node, throw an `Error` with a descriptive message -- it crosses the napi-rs boundary as a `BlazenError` with source attached. For Rust, return `Err(BlazenError::…)` directly; the error taxonomy is the same as for built-in providers.

## When not to subclass

If your backend speaks the OpenAI chat-completions wire format already (FastChat, vLLM, Ollama, LM Studio, LiteLLM), use `CompletionModel.openai` with a custom `base_url` via `ProviderOptions` instead. Subclassing is for providers that need genuinely different request/response handling, local execution, or side effects (signing, token exchange, hardware bring-up).

## See also

- [Batch Completions](/guides/batch-processing/) -- subclassed models work with the batch helper
- [Memory](/guides/memory/) -- drop in custom embedders and backends
- [Media Generation](/guides/media-generation/) -- see the built-in capability contracts you are implementing
- [Local Inference](/guides/local-inference/) -- for in-process providers, pair subclassing with `ModelManager` for VRAM budgeting

---

# Local Inference

Source: https://blazen.dev/docs/guides/local-inference
Section: guides

Blazen can run every major model class entirely on your own hardware -- no API key, no network call, no data leaving the machine. Local backends are opt-in via feature flags on the Rust crates, and the Python/Node packages ship prebuilt wheels that include the most common ones.

## Overview

| Capability | Backend | Rust crate | Feature flag |
|---|---|---|---|
| LLM chat | mistral.rs | `blazen-llm-mistralrs` | `mistralrs` |
| LLM chat | llama.cpp | `blazen-llm-llamacpp` | `llamacpp` |
| LLM chat | Candle | `blazen-llm-candle` | `candle` |
| Embeddings | Blazen embed (fastembed on glibc/mac/windows, tract on musl/wasm) | `blazen-embed` | `embed` |
| Embeddings | Candle | `blazen-embed-candle` | `candle-embed` |
| Transcription | whisper.cpp | `blazen-audio-whispercpp` | `whispercpp` |
| TTS | Piper | `blazen-audio-piper` | `piper` |
| Image generation | Stable Diffusion | `blazen-image-diffusion` | `diffusion` |

Every local provider implements the same trait as its remote counterpart -- `CompletionModel`, `EmbeddingModel`, `Transcription`, etc. -- so they slot into the exact same workflows and can be swapped with a one-line change.

## Local LLM (mistral.rs)

```python
from blazen import CompletionModel, MistralRsOptions, Quantization, Device, ChatMessage

model = CompletionModel.mistralrs(options=MistralRsOptions(
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    quantization=Quantization.Q4KM,
    device=Device.Cuda,
    context_length=8192,
))

# First call downloads the GGUF weights (~4 GB for Q4KM Mistral-7B).
# Subsequent calls reuse the cached weights.
await model.load()

response = await model.complete([ChatMessage.user("Hello!")])
print(response.content)

# Free VRAM when you need the GPU for something else.
await model.unload()
```

```rust
use blazen_llm_mistralrs::{MistralRsOptions, MistralRsProvider};
use blazen_llm::{ChatMessage, CompletionRequest, LocalModel, CompletionModel};

let mut provider = MistralRsProvider::new(
    MistralRsOptions::new("mistralai/Mistral-7B-Instruct-v0.3")
        .with_quantization("Q4KM")
        .with_device("cuda"),
)?;

provider.load().await?;
let response = provider
    .complete(CompletionRequest::new(vec![ChatMessage::user("Hello!")]))
    .await?;
println!("{}", response.content.unwrap_or_default());
```

`CompletionModel.mistralrs`, `CompletionModel.llamacpp`, and `CompletionModel.candle` all follow the same shape: required `model_id`, optional quantization and device hints, optional context length and cache directory.

### Typed streaming chunks

Streaming local completions yield named classes rather than anonymous objects, so editor autocomplete and type-checkers know exactly what fields are available.

- **mistral.rs (un-prefixed)**: `ChatMessageInput`, `InferenceChunk`, `InferenceChunkStream`, `InferenceResult`, `InferenceUsage`.
- **llama.cpp (`LlamaCpp` prefix)**: parallel surface -- `LlamaCppChatMessageInput`, `LlamaCppInferenceChunk`, `LlamaCppInferenceChunkStream`, `LlamaCppInferenceResult`, `LlamaCppInferenceUsage`.
- **Candle**: `CandleInferenceResult` only -- single-shot, no streaming.

```typescript
const stream: InferenceChunkStream = await model.completeStream([
  { role: "user", content: "Hello!" } satisfies ChatMessageInput,
]);
for await (const chunk of stream) {
  process.stdout.write(chunk.content ?? "");
}
```

Swap `InferenceChunkStream` for `LlamaCppInferenceChunkStream` when you build the model via `CompletionModel.llamacpp` -- the surface is identical, only the prefix changes.

```python
stream = await model.complete_stream([ChatMessage.user("Hello!")])
async for chunk in stream:
    print(chunk.content or "", end="")
```

## Local embeddings

```python
from blazen import EmbeddingModel, EmbedOptions

model = EmbeddingModel.local(options=EmbedOptions(
    model_name="BGESmallENV15",     # 384 dims, ~33 MB download
    cache_dir="/tmp/blazen-embed",
    max_batch_size=256,
))

resp = await model.embed(["hello", "world"])
print(len(resp.embeddings[0]))  # 384
```

```typescript
import { EmbeddingModel } from "blazen";

const model = EmbeddingModel.embed({
  modelName: "BGESmallENV15",
  cacheDir: "/tmp/blazen-embed",
});

const resp = await model.embed(["hello", "world"]);
```

Blazen's embed backend runs through ONNX Runtime on glibc/mac/windows and pure-Rust tract on musl -- CPU-only, no GPU required, no Python ML runtime needed. Models cache locally after the first download.

### Browser embeddings (WASM tract)

In the browser there is no `hf-hub` and no filesystem cache, so `TractEmbedModel.create(modelUrl, tokenizerUrl, options)` fetches the ONNX weights and the `tokenizer.json` directly over HTTP via `web_sys::fetch`. Host the two files on any CDN (or your own origin with permissive CORS) and pass the URLs in:

```typescript
import { TractEmbedModel } from "@blazen/sdk";

const model = await TractEmbedModel.create(
  "https://example.com/all-MiniLM-L6-v2.onnx",
  "https://example.com/tokenizer.json",
);
const result = await model.embed(["hello", "world"]);
```

Inference runs entirely on the main thread (or a Web Worker if you spawn one) using pure-Rust tract -- no WebGPU, no ONNX Runtime Web, no server round-trip.

## Local transcription (whisper.cpp)

```python
from blazen import Transcription, TranscriptionRequest, WhisperOptions, WhisperModel

transcriber = Transcription.whispercpp(options=WhisperOptions(
    model=WhisperModel.Base,
    language="en",
))

result = await transcriber.transcribe(
    TranscriptionRequest.from_file("/path/to/audio.wav")
)
print(result.text)
```

See the dedicated [Transcription guide](/guides/transcription/) for audio format requirements and model-size tradeoffs.

## VRAM budgeting with ModelManager

When running several local models side by side, the `ModelManager` tracks VRAM estimates and evicts the least-recently-used model when a new load would exceed the budget.

```python
from blazen import ModelManager, CompletionModel, EmbeddingModel, MistralRsOptions

manager = ModelManager(budget_gb=24)

llm = CompletionModel.mistralrs(options=MistralRsOptions(
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
))
embedder = EmbeddingModel.local()

await manager.register("llm", llm, vram_estimate_bytes=6 * 1024**3)
await manager.register("embed", embedder, vram_estimate_bytes=100 * 1024**2)

await manager.load("llm")
await manager.ensure_loaded("embed")   # fits, no eviction

print(await manager.used_bytes())       # bytes currently loaded
print(await manager.available_bytes())  # room left in the budget
```

```rust
use blazen_manager::ModelManager;

let manager = ModelManager::new(24 * 1024 * 1024 * 1024); // 24 GB
manager.register("llm", llm, Some(6 * 1024 * 1024 * 1024)).await?;
manager.load("llm").await?;
```

If a `register` + `load` call would blow the budget, the manager first unloads the least-recently-used model whose removal creates enough headroom. This lets you register far more models than fit in VRAM and rely on the LRU policy to keep the hot set resident.

## Model cache and downloads

All local backends download weights lazily on the first call to `load()` (or the first inference if you skip `load`). Weights are cached under the OS default model cache directory unless you override `cache_dir` on the options struct. Typical sizes:

- **Tiny Whisper**: ~32 MB
- **BGE-Small-en-v1.5**: ~33 MB
- **Mistral-7B Q4KM**: ~4.1 GB
- **Stable Diffusion XL base**: ~6 GB
- **Whisper Large-V3**: ~3.1 GB

Set a shared `cache_dir` across providers to keep everything in one place and make disk usage auditable.

## Choosing CPU vs GPU

- **CPU-only workloads** -- Blazen embed, whisper.cpp (without `cuda`/`metal`/`coreml`), llama.cpp CPU builds, Piper. Use when the deployment target has no GPU or when latency is not critical.
- **GPU-accelerated** -- mistral.rs (`device=Cuda`/`Metal`), llama.cpp (`cuda`/`metal`), Candle, whisper.cpp with the right feature flag, Stable Diffusion. Use when per-token latency matters or the model is too large for CPU throughput.

Feature flags are selected at build time. The default Python wheels ship with `embed`, `whispercpp`, and `mistralrs` enabled; enable more by building from source with extra flags.

## See also

- [Embeddings](/guides/python/embeddings/) -- deeper dive on the local embed model variants
- [Transcription](/guides/transcription/) -- whisper.cpp audio format requirements and model sizes
- [Media Generation](/guides/media-generation/) -- for cloud-hosted counterparts to local generation
- [Custom Providers](/guides/custom-providers/) -- wrap a local binary or gRPC service that Blazen does not ship

---

# Chat Window (Token-Limited Conversations)

Source: https://blazen.dev/docs/guides/chat-window
Section: guides

`ChatWindow` is a rolling buffer of `ChatMessage` objects that enforces a token budget. When you append a message that would push the buffer over budget, the oldest **non-system** messages are evicted until the window fits. System messages are never dropped, so persistent instructions stay at the top of the context.

## Overview

Most chat applications accumulate conversation history indefinitely. That works until the token count hits the model's context limit, at which point completions silently truncate or fail. `ChatWindow` handles the bookkeeping for you: set a budget, append messages as they arrive, and hand the buffer to `CompletionModel.complete()` without worrying about overflow.

Token counting uses a characters-per-token heuristic (3.5 chars/token by default, tunable). This is an estimate, not a tokenizer -- budget ~10% headroom below the model's hard context limit for safety.

## Basic usage

```python
from blazen import ChatWindow, ChatMessage, CompletionModel

window = ChatWindow(max_tokens=4000)
window.add(ChatMessage.system("You are a terse, helpful assistant."))

model = CompletionModel.openai()

async def turn(user_input: str) -> str:
    window.add(ChatMessage.user(user_input))
    response = await model.complete(window.messages())
    window.add(ChatMessage.assistant(response.content or ""))
    return response.content or ""

await turn("What is 2 + 2?")
await turn("And 3 + 3?")
# ... many turns later ...
# The system message is still at position 0; the oldest user/assistant pairs
# have been evicted to stay under 4000 tokens.
```

```typescript
import { ChatWindow, ChatMessage, CompletionModel } from "blazen";

const window = new ChatWindow(4000);
window.add(ChatMessage.system("You are a terse, helpful assistant."));

const model = CompletionModel.openai({ apiKey: process.env.OPENAI_API_KEY });

async function turn(userInput: string): Promise<string> {
  window.add(ChatMessage.user(userInput));
  const response = await model.complete(window.messages());
  window.add(ChatMessage.assistant(response.content ?? ""));
  return response.content ?? "";
}

await turn("What is 2 + 2?");
await turn("And 3 + 3?");
```

```rust
use blazen_llm::chat_window::ChatWindow;
use blazen_llm::{ChatMessage, CompletionRequest};
use blazen_llm::providers::openai::OpenAiProvider;
use blazen_llm::traits::CompletionModel;

let mut window = ChatWindow::new(4000);
window.add(ChatMessage::system("You are a terse, helpful assistant."));

let model = OpenAiProvider::from_env()?;

async fn turn(
    window: &mut ChatWindow,
    model: &OpenAiProvider,
    user_input: &str,
) -> anyhow::Result<String> {
    window.add(ChatMessage::user(user_input));
    let request = CompletionRequest::new(window.messages().to_vec());
    let response = model.complete(request).await?;
    let content = response.content.unwrap_or_default();
    window.add(ChatMessage::assistant(&content));
    Ok(content)
}
```

## Inspecting the window

```python
print(window.token_count())       # current estimated token count
print(window.remaining_tokens())  # tokens left in the budget
print(len(window.messages()))     # message count
window.clear()                    # drop everything, including system messages
```

```typescript
console.log(window.tokenCount());
console.log(window.remainingTokens());
console.log(window.length);
window.clear();
```

## Tuning the estimator

The default 3.5 chars/token ratio matches OpenAI BPE tokenization for English text reasonably well. For code-heavy conversations, Chinese/Japanese/Korean text, or custom tokenizers, override the ratio:

```rust
use blazen_llm::chat_window::ChatWindow;

let window = ChatWindow::new(8000).with_chars_per_token(2.5);
```

The Python and Node wrappers currently expose only the default estimator. If you need a provider-exact token count, run `count_message_tokens` from the top-level API or compute tokens on your side and size the window accordingly.

## Eviction policy

- **System messages are never evicted.** Put long system prompts at the top of the window and keep them there.
- **Oldest non-system message first.** Eviction is strictly FIFO across user/assistant turns.
- **No partial eviction.** The oldest message is removed in full; no half-messages appear in the buffer.

If you need summarisation-based compression rather than hard eviction -- asking the model to fold older turns into a short synopsis before dropping them -- implement it as a workflow step that consumes the `ChatWindow`, writes a summary, clears the buffer, and seeds it with a fresh system+summary pair.

## Integration patterns

### With `Memory` for long-term recall

Pair a short `ChatWindow` (recent turns verbatim) with `Memory` (embedded long-term recall). On every turn, query `Memory` for the top-k relevant past exchanges and splice them into the window between the system prompt and the live turns.

```python
from blazen import Memory, InMemoryBackend, EmbeddingModel

memory = Memory(EmbeddingModel.openai(), InMemoryBackend())

async def turn(user_input: str) -> str:
    # Recall relevant history from Memory.
    recalls = await memory.search(user_input, limit=3)
    window.clear()
    window.add(ChatMessage.system(SYSTEM_PROMPT))
    for r in recalls:
        window.add(ChatMessage.user(r.text))
    window.add(ChatMessage.user(user_input))

    response = await model.complete(window.messages())
    await memory.add(str(uuid.uuid4()), f"Q: {user_input}\nA: {response.content}")
    return response.content
```

### With tool-calling agents

`Agent` uses a `ChatWindow` internally for its scratchpad. When you need to cap the agent's working memory, pass a pre-configured window rather than relying on the default.

## See also

- [Memory](/guides/memory/) -- complementary long-term semantic recall
- [Prompt Templates](/guides/prompts/) -- render the system prompt that anchors the window
- [Batch Completions](/guides/batch-processing/) -- build independent windows per conversation and fan them out

---

# Telemetry & Observability

Source: https://blazen.dev/docs/guides/telemetry
Section: guides

Blazen emits structured tracing data via the standard `tracing` crate ecosystem -- workflow runs, steps, LLM calls, pipeline stages, and provider IO are all instrumented as spans with typed fields. Multiple exporters are available, each behind a Cargo feature on `blazen-telemetry`. Pick the one that matches your destination and runtime.

## Exporter matrix

| Exporter | Feature | Transport | Wasm-eligible | Use case |
|---|---|---|---|---|
| OTLP gRPC | `otlp` | gRPC (tonic) | No (native only) | OpenTelemetry collectors, native services |
| OTLP HTTP | `otlp-http` | HTTP/protobuf | Yes | Wasm, restricted-egress networks, Cloudflare Workers |
| Langfuse | `langfuse` | HTTP REST | No (native only) | LLM-call observability + evals |
| Prometheus | `prometheus` | HTTP scrape | No (native only) | Metric dashboards, alerting |

The Python and Node bindings ship prebuilt with `langfuse`, `otlp`, and `prometheus` compiled in. The wasm SDK only exposes the OTLP HTTP transport (gRPC's `tonic` does not compile for `wasm32`).

## OTLP gRPC

Best for native services exporting to a local OpenTelemetry Collector, Grafana Tempo, Honeycomb, Datadog, or any other gRPC-compatible OTLP backend.

```rust
use blazen_telemetry::{OtlpConfig, init_otlp};

let cfg = OtlpConfig {
    endpoint: "http://localhost:4317".to_string(),
    service_name: "my-service".to_string(),
};
init_otlp(cfg)?;
```

Python:

```python
from blazen import OtlpConfig, init_otlp

init_otlp(OtlpConfig(endpoint="http://localhost:4317", service_name="my-service"))
```

`init_otlp` installs a combined `tracing-subscriber` stack (env-filter + OpenTelemetry layer + fmt layer) and registers it as the global subscriber. Call it once at process startup, before any traced work.

## OTLP HTTP (wasm-eligible)

Use when gRPC is blocked (corporate proxies, Cloudflare Workers) or when you are running inside the browser / a wasm runtime. The endpoint should point at the collector's HTTP/protobuf traces ingest path (typically `:4318/v1/traces`).

```rust
use blazen_telemetry::{OtlpConfig, init_otlp_http};

let cfg = OtlpConfig {
    endpoint: "https://otel-collector.example.com:4318/v1/traces".to_string(),
    service_name: "my-worker".to_string(),
};
init_otlp_http(cfg)?;
```

Wasm SDK (`@blazen/sdk`):

```typescript
import { OtlpConfig, initOtlp } from "@blazen/sdk";

const cfg = new OtlpConfig(
  "https://otel-collector.example.com:4318/v1/traces",
  "my-worker",
);
initOtlp(cfg);
```

The wasm SDK ships a custom `WasmFetchHttpClient` because `opentelemetry-otlp/grpc-tonic` is not wasm-compatible and `reqwest`'s wasm32 client is `!Send`. The wrapper backs onto `web_sys::fetch` so Workers, browsers, and Deno all work.

## Langfuse

Langfuse maps Blazen's span hierarchy onto its trace / span / generation primitives:

| Blazen span | Langfuse object | Ingestion event |
|---|---|---|
| `workflow.run`, `pipeline.run` | Trace | `trace-create` |
| `workflow.step`, `pipeline.stage` | Span | `span-create` |
| `llm.complete`, `llm.stream` | Generation | `generation-create` |

Token usage (`prompt_tokens`, `completion_tokens`, `total_tokens`) and model metadata are extracted into the generation's `usage` and `model` fields.

```rust
use blazen_telemetry::{LangfuseConfig, init_langfuse};
use tracing_subscriber::prelude::*;

let cfg = LangfuseConfig::new("pk-lf-...", "sk-lf-...")
    .with_host("https://cloud.langfuse.com")
    .with_batch_size(100)
    .with_flush_interval_ms(5000);

let layer = init_langfuse(cfg)?;
tracing_subscriber::registry().with(layer).init();
```

Python:

```python
from blazen import LangfuseConfig, init_langfuse

init_langfuse(LangfuseConfig(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com",
    batch_size=100,
    flush_interval_ms=5000,
))
```

Node:

```typescript
import { LangfuseConfig, initLangfuse } from "blazen";

initLangfuse(new LangfuseConfig(
  "pk-lf-...",
  "sk-lf-...",
  "https://cloud.langfuse.com",
  100,
  5000,
));
```

In the Python and Node bindings `init_langfuse` / `initLangfuse` install the layer as the global subscriber for you and spawn the background flush task on the napi-rs / pyo3 tokio runtime. In Rust you compose the returned `LangfuseLayer` into your own `Registry`.

## Prometheus

Native-only. Installs a global `metrics` recorder and starts an HTTP listener on `0.0.0.0:{port}` that serves the `/metrics` endpoint for Prometheus to scrape. After init, any code using the `metrics` macros (`counter!`, `histogram!`, `gauge!`) is exposed automatically.

```rust
use blazen_telemetry::init_prometheus;

init_prometheus(9090)?;
```

Python:

```python
from blazen import init_prometheus

init_prometheus(9090)
```

Then point Prometheus at `http://your-host:9090/metrics`.

## Composing multiple exporters

Only one global `tracing` subscriber can be installed per process. The single-call helpers (`init_otlp`, `init_otlp_http`, `init_prometheus`, and the Python/Node `init_langfuse`) each install their own subscriber, so the second call is a soft no-op -- the underlying layer's background dispatcher still runs (events are batched and sent), but the global subscriber stays as it was first installed.

To run multiple span exporters off a single subscriber, build the layers manually in Rust and compose them:

```rust
use blazen_telemetry::{LangfuseConfig, init_langfuse};
use tracing_subscriber::{EnvFilter, prelude::*};

let langfuse = init_langfuse(LangfuseConfig::new("pk-lf-...", "sk-lf-..."))?;
let env_filter = EnvFilter::try_from_default_env()
    .unwrap_or_else(|_| EnvFilter::new("info"));

tracing_subscriber::registry()
    .with(env_filter)
    .with(langfuse)
    .with(tracing_subscriber::fmt::layer())
    // .with(otel_layer) // build OTLP layer manually if you want both
    .init();
```

Prometheus is independent of the tracing subscriber (it installs a `metrics` recorder, not a `tracing::Subscriber`), so `init_prometheus` composes cleanly with any of the trace exporters.

## Feature flag setup

Cargo:

```toml
[dependencies]
blazen-telemetry = { version = "...", features = ["langfuse", "otlp-http", "prometheus"] }
```

Pick `otlp` instead of `otlp-http` if you want gRPC. Both can be enabled together; they expose `init_otlp` and `init_otlp_http` as separate entry points.

Python:

```bash
# Wheels ship with langfuse, otlp, and prometheus compiled in.
pip install blazen
```

Node:

```bash
# Prebuilt binaries include langfuse, otlp, and prometheus.
# Node bindings currently expose only LangfuseConfig / initLangfuse;
# OTLP and Prometheus are reachable from Rust or Python.
npm install blazen
```

Wasm SDK (`@blazen/sdk`) exposes `OtlpConfig` + `initOtlp` (HTTP transport only).

---

# Multimodal Tools: Inputs and Results

Source: https://blazen.dev/docs/guides/tool-multimodal
Section: guides

Tools in Blazen are no longer text-only at either end. On the **input** side, a tool can declare an image / audio / video / document / 3D / CAD parameter, and the model fills it in by emitting a content-handle id as a JSON string -- the framework substitutes the resolved typed content before your handler runs. On the **output** side, a tool can return text + image / audio / video / file blocks, and Blazen serializes them to the right wire shape for whichever provider is on the other end -- Anthropic, OpenAI Chat, OpenAI Responses, Azure, Gemini, fal.ai, or any OpenAI-compatible backend.

This guide is the cross-cutting reference for both halves. The per-language guides in `/guides/rust/multimodal/`, `/guides/python/multimodal/`, `/guides/node/multimodal/`, and `/guides/wasm/multimodal/` cover the binding-specific surface. This page covers the framework-level model -- what the wire actually looks like, how the resolver works, and how the same tool definition behaves identically across every provider.

## The two halves of the problem

Tools have always had two boundaries that broke when bytes were involved.

**Tool inputs are JSON.** A model emits a tool call as JSON. There is no mechanism for it to attach a 5 MB PNG, or even a URL it has not seen before. The only thing the model can put in a tool argument is a string. Blazen's solution is the **content-handle** indirection: bytes are registered with a `ContentStore` once, the store hands back a `ContentHandle` whose `id` is a short opaque string, and the model passes that id as the argument. The framework -- given a store -- substitutes the resolved typed content before the tool's handler executes. The tool sees a fully-materialized image / audio / file, never a bare id.

**Tool results were silently text-only on most providers.** Until recently, only Anthropic's `tool_result.content` natively carried multimodal blocks; every other provider's serializer stripped non-text parts or wrapped them in non-standard envelopes that the model would not interpret. That has been fixed: every provider now serializes a tool result with non-text parts as the right combination of a primary tool-result message plus a follow-up multimodal user message (or, on the Responses API and Gemini, the API-native equivalent). A tool that returns an annotated image is now visible to the model on every backend, with the same Blazen-level API.

## Tool inputs: the read path

The lifecycle for a multimodal tool argument has six steps. The contract between them is fixed -- the only thing that changes per binding is the surface syntax.

1. **Register content with a store.** Call `store.put(bytes, ...)` with optional `kind`, `mime_type`, `display_name`, and `byte_size` hints. You receive a `ContentHandle { id, kind, mime_type, byte_size, display_name }`. The `id` is the only thing that needs to travel through the conversation.
2. **Make the model aware of the handle.** A short system note enumerates every handle currently in scope, with its kind, mime, and size. The model reads this and learns it can pass these ids to tools. The note is built by `build_handle_directory_system_note` in Rust and inserted automatically by the agent runner when a content store is wired in.
3. **The tool declares a content-typed parameter.** Use one of `image_input`, `audio_input`, `video_input`, `file_input`, `three_d_input`, or `cad_input` to produce a JSON Schema fragment for the parameter. These are sugar over `content_ref_required_object` / `content_ref_property`, which take an arbitrary `ContentKind`.
4. **The model emits a string handle id.** A tool call argument value looks like `{"photo": "blazen_a1b2c3d4..."}`. To the provider this is just a JSON string -- there is no special wire format involved.
5. **The framework resolves the handle.** Before the handler runs, `resolve_tool_arguments` walks the arguments JSON against the schema, finds every property tagged `x-blazen-content-ref`, looks the handle up in the store, and rewrites the value into a typed object: `{ kind, handle_id, mime_type, byte_size, display_name, source }`. The `source` field is the resolved `MediaSource` (URL, base64 inline, or a provider-specific file id) and is the same shape returned by `ContentStore::resolve`.
6. **The handler runs with materialized content.** Read the typed fields directly. If you need raw bytes (for image processing, hashing, etc.), call `store.fetch_bytes(handle)` or `store.fetchBytes(handle)` to pull them back out of whichever backend the store is using.

The model never sees the raw bytes, the resolver never goes back to the model, and the tool never has to worry about which provider is on the other end. Each layer only knows what it needs to.

## The `x-blazen-content-ref` schema tag

The schema fragment produced by `image_input("photo", "the photo to analyze")` looks like this on the wire:

```json
{
  "type": "object",
  "properties": {
    "photo": {
      "type": "string",
      "description": "the photo to analyze",
      "x-blazen-content-ref": { "kind": "image" }
    }
  },
  "required": ["photo"]
}
```

This is a standard JSON Schema object. Every provider that accepts JSON Schema for tool parameters will accept it -- the `x-blazen-content-ref` extension key is invisible to providers that do not know about it (JSON Schema is open to vendor extensions by design). Blazen's resolver reads it on the way back in to identify which properties hold handle ids and what kind to enforce. If the resolved handle's `kind` does not match the expected kind, the resolver returns a `KindMismatch` error and the tool is not invoked.

For tools that need both a content reference and additional non-multimodal parameters in the same schema, drop down to `content_ref_required_object("photo", ContentKind::Image, "...", extra_props)` (Rust) and merge in your other properties. The resolver walks nested objects, so the tag also works inside compound shapes.

## Tool results: the write path

A tool returns a `ToolOutput` carrying two things: `data`, the typed value the calling code sees, and an optional `llm_override`, an `LlmPayload` that controls what the model sees on the next turn. The `LlmPayload` enum has four variants; the one this guide is about is `Parts`, which carries a `Vec<ContentPart>` of text + image / audio / video / file blocks.

Returning `Parts` works the same way regardless of which provider backs the agent. Blazen's per-provider serializers translate the parts into the wire shape the destination API expects.

| Provider | Wire shape for `LlmPayload::Parts` |
|---|---|
| Anthropic | Native multimodal `tool_result.content` -- text, image, and document blocks pass through unchanged. |
| OpenAI Chat | `role: "tool"` message carrying the text portion, immediately followed by a `role: "user"` message containing `image_url` / `input_audio` / file content blocks for the non-text parts. |
| OpenAI Responses | `function_call_output` item carrying the text, immediately followed by separate `input_image` / `input_file` items in the input array. |
| Azure OpenAI | Same as OpenAI Chat -- the wire is API-compatible. |
| Gemini | `functionResponse` carrying `{"result": <text>}`, followed by a `Content { role: "user", parts: [...] }` carrying `inlineData` / `fileData` parts. |
| OpenAI-compat (Groq, DeepSeek, Together, Fireworks, Perplexity, xAI, OpenRouter, Cohere, Mistral, Bedrock-Mantle) | Same as OpenAI Chat. |
| fal.ai | Same as OpenAI Chat. |

The follow-up multimodal user message is the same pattern that has always worked for sending a multimodal user turn -- Blazen is just emitting it on the tool's behalf so the model receives the bytes the tool returned. Models that do not accept multimodal user content (text-only chat models on a given provider) will reject the follow-up; that is the same failure mode as sending a multimodal user message to a text-only model directly.

If you need an entirely provider-specific wire shape -- say you want to return Anthropic's experimental search-result content type that no other provider models -- use `LlmPayload::ProviderRaw { provider, value }` instead. The named provider receives `value` verbatim in the tool-result body; every other provider falls back to the default conversion from `ToolOutput::data`.

## Putting it together: a complete example

The example below declares an `analyze_photo` tool. It takes an image as input via a content handle, runs some processing (here, a stub that draws a rectangle), and returns both a typed JSON description for the caller and a multimodal payload with the annotated overlay for the model.

#### Rust

```rust
use blazen_llm::content::{
    tool_input::image_input,
    ContentHandle, ContentKind, ContentStore,
};
use blazen_llm::types::{
    ContentPart, ImageContent, ImageSource, LlmPayload, ToolOutput,
};
use base64::Engine;
use serde_json::{json, Value};
use std::sync::Arc;

async fn analyze_photo(
    args: Value,
    store: Arc<dyn ContentStore>,
) -> anyhow::Result<ToolOutput<Value>> {
    // The resolver has already rewritten args["photo"] from a handle-id
    // string into the typed object: { kind, handle_id, mime_type, ... }.
    let handle_id = args["photo"]["handle_id"]
        .as_str()
        .ok_or_else(|| anyhow::anyhow!("missing handle_id"))?;
    let mime = args["photo"]["mime_type"]
        .as_str()
        .unwrap_or("image/png")
        .to_string();

    // Reconstruct a handle from the id + expected kind, then pull bytes.
    let handle = ContentHandle::new(handle_id, ContentKind::Image);
    let bytes = store.fetch_bytes(&handle).await?;

    // ... run analysis, produce annotated image bytes ...
    let annotated_bytes: Vec<u8> = annotate(&bytes);
    let annotated_b64 =
        base64::engine::general_purpose::STANDARD.encode(&annotated_bytes);

    // Caller-visible structured data.
    let data = json!({
        "width": 1024,
        "height": 768,
        "objects_detected": ["dog", "frisbee"],
    });

    // Model-visible payload: text + the annotated overlay.
    let parts = vec![
        ContentPart::Text {
            text: "Detected 2 objects. Annotated overlay below:".into(),
        },
        ContentPart::Image(ImageContent {
            source: ImageSource::Base64 { data: annotated_b64 },
            media_type: Some(mime),
        }),
    ];

    Ok(ToolOutput::with_override(data, LlmPayload::Parts { parts }))
}

// Schema for the tool, declaring `photo` as an image input.
fn analyze_photo_schema() -> Value {
    image_input("photo", "the photo to analyze for objects")
}
```

#### Python

```python
from blazen import (
    ContentHandle, ContentKind, ContentStore,
    LlmPayload, ToolOutput, image_input,
)
import base64

async def analyze_photo(args: dict, store: ContentStore) -> ToolOutput:
    # The resolver has already rewritten args["photo"] from a handle-id
    # string into a typed dict: { kind, handle_id, mime_type, ... }.
    handle_id = args["photo"]["handle_id"]

    # Reconstruct a handle from the id + expected kind, then pull bytes.
    handle = ContentHandle(handle_id, ContentKind.Image)
    raw = await store.fetch_bytes(handle)

    # ... run analysis, produce annotated image bytes ...
    annotated: bytes = annotate(raw)
    annotated_b64 = base64.b64encode(annotated).decode("ascii")

    data = {
        "width": 1024,
        "height": 768,
        "objects_detected": ["dog", "frisbee"],
    }

    # The Python binding currently exposes only the text / json /
    # provider_raw LlmPayload factories — full multimodal `Parts`
    # construction is Rust-only today. From Python, return a text summary
    # for the model and the structured data for callers; if you need to
    # send the annotated image back to the model, store it via
    # `await store.put(annotated, kind=ContentKind.Image, ...)` and pass
    # the resulting handle id in the text body.
    return ToolOutput(
        data=data,
        llm_override=LlmPayload.text(
            "Detected 2 objects (dog, frisbee). Annotated overlay was "
            "generated and stored — fetch via the returned handle."
        ),
    )

# Schema for the tool, declaring `photo` as an image input.
analyze_photo_schema = image_input("photo", "the photo to analyze for objects")
```

#### Node

```typescript
import { ContentStore, imageInput } from "blazen";
import type {
  ContentHandle,
  JsContentPart,
  JsImageContent,
  LlmPayload,
  ToolOutput,
} from "blazen";

async function analyzePhoto(
  args: { photo: { handle_id: string; mime_type?: string } },
  store: ContentStore,
): Promise<ToolOutput> {
  // The resolver has already rewritten args.photo from a handle-id string
  // into a typed object: { kind, handle_id, mime_type, ... }.
  // (Rust resolver emits snake_case keys; they pass through napi as-is.)
  const handleId = args.photo.handle_id;
  const mime = args.photo.mime_type ?? "image/png";

  // Reconstruct a handle from the id + expected kind, then pull bytes.
  const handle: ContentHandle = { id: handleId, kind: "image" };
  const raw: Buffer = await store.fetchBytes(handle);

  // ... run analysis, produce annotated image bytes ...
  const annotated = annotate(raw);
  const annotatedB64 = annotated.toString("base64");

  const data = {
    width: 1024,
    height: 768,
    objectsDetected: ["dog", "frisbee"],
  };

  const annotatedImage: JsImageContent = {
    source: { sourceType: "base64", data: annotatedB64 },
    mediaType: mime,
  };

  const parts: JsContentPart[] = [
    { partType: "text", text: "Detected 2 objects. Annotated overlay below:" },
    { partType: "image", image: annotatedImage },
  ];

  const llmOverride: LlmPayload = { kind: "parts", parts };
  return { data, llmOverride };
}

// Schema for the tool, declaring `photo` as an image input.
const analyzePhotoSchema = imageInput("photo", "the photo to analyze for objects");
```

The same tool, the same schema, the same handler shape. Whether the agent is talking to Anthropic, OpenAI Responses, Gemini, or Groq, the `parts` get serialized into the wire shape that provider understands.

## Choosing a `ContentStore`

The store is the lifecycle manager for content. Pick by where you want bytes to live and which provider's native files API you want to take advantage of.

| Use case | Recommended store |
|---|---|
| Quick scripts, tests, ephemeral content | `InMemoryContentStore` (`ContentStore.in_memory()` / `ContentStore.inMemory()`) |
| Persistence across restarts | `LocalFileContentStore` (`ContentStore.local_file(path)` / `ContentStore.localFile(root)`) -- native targets only, not WASM |
| Anthropic-heavy workload, large PDFs | `AnthropicFilesStore` -- uploads to Anthropic's Files API so PDFs and large images are referenced by file id rather than re-sent inline every turn |
| OpenAI-heavy workload | `OpenAiFilesStore` -- same idea against OpenAI's Files API |
| Gemini-heavy workload | `GeminiFilesStore` -- against Gemini's Files API |
| fal.ai compute / hosted media | `FalStorageStore` -- against fal's object storage |
| S3 / R2 / your own backend | All four environments now expose user-defined stores: Rust uses `CustomContentStore::builder(...)`, Python uses `ContentStore.custom(...)` or `class S3ContentStore(ContentStore): ...`, Node uses `ContentStore.custom({...})` or `class S3ContentStore extends ContentStore { ... }`, WASM uses the same shape from `@blazen/sdk`. |

All stores implement the same contract: `put`, `resolve`, `fetch_bytes`, `metadata`, `delete`. The choice determines where bytes physically live and what shape `resolve` returns -- in-memory hands back base64, the Anthropic / OpenAI / Gemini / fal stores hand back a provider file id, local-file hands back a path or base64 depending on the provider being targeted.

A note on WASM: the WASM SDK exposes the same factory names (`ContentStore.inMemory()`, `openaiFiles(...)`, `anthropicFiles(...)`, `geminiFiles(...)`, `falStorage(...)`) but does **not** include `localFile` (no filesystem in the browser) and the `metadata` method is not exposed (use `resolve` for the same metadata fields). The `put` signature is positional rather than options-based: `put(body, kindHint?, mimeType?, displayName?)`.

## Cross-provider portability

What happens when content originally registered against one provider's files API is sent to a request that goes to a different provider? For example, you uploaded a PDF via `OpenAiFilesStore` for an OpenAI run, and a follow-up step sends the same conversation -- with the same handle in scope -- to Anthropic.

The framework looks at the handle's resolved `MediaSource`. If it is a `ProviderFile` for a provider other than the destination, it needs to **rehost**: download the bytes from the originating store, then either re-upload them via the destination's API or inline them as base64 if the file is small enough to fit. Both halves require a `ContentStore` to be wired into the request path -- the rehost call goes through `fetch_bytes` on the originating store and `put` on the destination store. Without a store wired in, the framework cannot reach the bytes; the part is dropped with a warning and the request proceeds as if it had never been there.

The practical implication: if you mix providers in a single agent or workflow, wire a content store in. Use one of the provider-specific stores for the provider you talk to most often, and the framework will handle rehosting for the rest.

## Pre-resolving handles before the wire call

Inside the agent runner, two things need to happen before a `CompletionRequest` goes out: every `ImageSource::Handle` in the message history needs to be resolved against the store (so the wire payload carries actual base64 / URL / file-id values), and the model needs a system note describing every handle in scope (so it knows which ids it can pass as tool arguments). Both pieces are wrapped into a single helper:

```rust
use blazen_llm::content::visibility::prepare_request_with_store;

let resolved = prepare_request_with_store(&mut request, store.as_ref()).await?;
println!("resolved {resolved} handle(s)");
```

`prepare_request_with_store` snapshots the visible handles via `collect_visible_handles`, calls `request.resolve_handles_with(store)`, and prepends a system message built by `build_handle_directory_system_note`. Use the individual functions if you only need one half -- for example, `resolve_handles_with` alone if you do not want a directory system note (because you embed handle ids elsewhere in your prompt template).

In the Python and Node bindings, the agent runner calls the equivalent code automatically when an agent has a content store wired in -- you do not need to invoke it by hand. Drop down to the Rust helpers when you are running a `CompletionRequest` directly without going through the agent runner, or when you want to inspect the resolved request before dispatch.

## Streaming large content

For large blobs -- multi-hundred-megabyte videos, hour-long audio captures, big PDFs -- buffering the entire payload into a `Vec<u8>` before handing it back to the caller wastes memory and stalls the first byte. Blazen now models streaming as a first-class `ContentBody` variant alongside the existing `Bytes` variant:

```rust
pub enum ContentBody {
    Bytes(Vec<u8>),
    Stream(Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Send>>),
}
```

The companion trait method on `ContentStore` is `fetch_stream`, which returns a `ContentBody`. The default implementation falls back to `fetch_bytes` and wraps the result in a single-chunk stream, so existing stores keep working without changes; backends that can do better override `fetch_stream` directly.

### Per-binding state today

**Rust** has full streaming on both halves -- `put` accepts a `ContentBody::Stream` and `fetch_stream` can return one. Native built-in stores that override `fetch_stream` for true chunk-by-chunk delivery:

- `LocalFileContentStore` -- streams from disk via `tokio_util::io::ReaderStream`.
- `OpenAiFilesStore` -- streams from the OpenAI Files API.
- `AnthropicFilesStore` -- streams from the Anthropic Files API.
- `FalStorageStore` -- streams from fal's object storage via `HttpClient::send_streaming`.

`InMemoryContentStore` and `GeminiFilesStore` still use the buffered default (the in-memory case has nothing to stream from; the Gemini case is a follow-up).

**Python**, **Node**, and **WASM** bindings now stream end-to-end across the FFI boundary. The host-language shapes are:

- **Python** -- `fetch_stream` may return either `bytes` (legacy) or an `AsyncIterator[bytes]`; a streaming `put` body arrives as `body["stream"]`, an `AsyncByteIter` you iterate with `async for`.
- **Node** -- `fetchStream` may return `Buffer` / `Uint8Array` / `number[]` / base64 `string` (legacy) or an `AsyncIterable<Uint8Array>`; a streaming `put` body arrives as `body.stream`, an `AsyncIterable<Uint8Array>`.
- **WASM** -- `fetchStream` may return `Uint8Array` / `number[]` (legacy) or a `ReadableStream<Uint8Array>`; a streaming `put` body arrives as `body.stream`, a `ReadableStream<Uint8Array>` you read with `getReader()`.

Each binding also exposes `fetch_stream(handle)` / `fetchStream(handle)` on the `ContentStore` wrapper itself, so host code can iterate chunks directly off any built-in or custom store without round-tripping through `fetch_bytes`.

### Cross-binding example

Override `fetch_stream` on a custom store. The same pattern works in every environment; only the surface syntax changes.

#### Rust

```rust
use blazen_llm::content::{
    ContentBody, ContentHandle, ContentStore, CustomContentStore,
};
use bytes::Bytes;
use futures::stream;

let store = CustomContentStore::builder("s3")
    .with_fetch_stream(|handle: ContentHandle| async move {
        // Open a streaming GET against S3 / R2 / your backend.
        let chunks = my_s3_client.get_streaming(&handle.id).await?;
        // Return a ContentBody::Stream so the caller can pull byte-by-byte.
        Ok(ContentBody::Stream(Box::pin(chunks)))
    })
    .build();
```

#### Python

```python
from blazen import ContentStore, ContentHandle

class S3ContentStore(ContentStore):
    async def fetch_stream(self, handle: ContentHandle) -> bytes:
        # Today the binding drains the underlying Rust stream into bytes
        # before calling this method, and wraps the bytes you return in a
        # single-chunk Rust stream. The override point is in place; true
        # chunked async-iterator bridging is a follow-up.
        return await self._s3.get_object_bytes(handle.id)
```

#### Node

```typescript
import { ContentStore } from "blazen";
import type { ContentHandle } from "blazen";

class S3ContentStore extends ContentStore {
  async fetchStream(handle: ContentHandle): Promise<Buffer> {
    // Same caveat as Python -- Buffer in, single-chunk Rust stream out.
    return await this.s3.getObjectBuffer(handle.id);
  }
}
```

#### WASM

```typescript
import { ContentStore } from "@blazen/sdk";
import type { ContentHandle } from "@blazen/sdk";

const store = ContentStore.custom({
  async fetchStream(handle: ContentHandle): Promise<Uint8Array> {
    // Same caveat -- Uint8Array in, single-chunk Rust stream out.
    const res = await fetch(`/objects/${handle.id}`);
    return new Uint8Array(await res.arrayBuffer());
  },
});
```

When you only need bytes, keep calling `fetch_bytes` -- it now delegates to `fetch_stream` and concatenates, so both APIs stay in sync regardless of which one the store overrides.

## See also

- [Multimodal Content (Rust)](/guides/rust/multimodal/) -- the full Rust API for `ContentStore`, `ContentHandle`, and the typed input helpers
- [Multimodal Content (Python)](/guides/python/multimodal/) -- the same surface in Python; multimodal tool *results* are Rust-only today, but content handles, stores, and tool *inputs* work the same
- [Multimodal Content (Node)](/guides/node/multimodal/) -- the napi-rs binding's `ContentStore`, `imageInput`, and friends
- [Multimodal Content (WASM)](/guides/wasm/multimodal/) -- the browser SDK with its narrower store set
- [Custom Providers](/guides/custom-providers/) -- if you are wrapping your own model backend, this is where the `LlmPayload::Parts` -> wire-format translation lives

---

# Rust API Reference

Source: https://blazen.dev/docs/api/rust
Language: rust
Section: api

## Feature Flags

`blazen-llm` providers:

| Feature | Description |
|---------|-------------|
| `openai` | Enables `OpenAiProvider` and `OpenAiCompatProvider` (covers OpenRouter, Groq, Together, Mistral, DeepSeek, Fireworks, Perplexity, xAI, Cohere, Bedrock) |
| `anthropic` | Enables `AnthropicProvider` |
| `gemini` | Enables `GeminiProvider` |
| `fal` | Enables `FalProvider` (compute: image, video, audio, 3D) |
| `azure` | Enables `AzureOpenAiProvider` |
| `all-providers` | Enables all provider implementations |

`blazen-llm` local-inference backends (each gated behind its own feature, all bundled in the `all-local` umbrella):

| Feature | Re-exports from `blazen_llm::*` |
|---------|---------------------------------|
| `mistralrs` | `MistralRsProvider`, `ChatMessageInput`, `ChatRole`, `InferenceChunk`, `InferenceChunkStream`, `InferenceImage`, `InferenceImageSource`, `InferenceResult`, `InferenceToolCall`, `InferenceUsage`, `MistralRsError`, `MistralRsOptions` |
| `llamacpp` | `LlamaCppProvider`, `LlamaCppChatMessageInput`, `LlamaCppChatRole`, `LlamaCppInferenceChunk`, `LlamaCppInferenceChunkStream`, `LlamaCppInferenceResult`, `LlamaCppInferenceUsage`, `LlamaCppError`, `LlamaCppOptions` |
| `candle-llm` | `CandleLlmProvider`, `CandleLlmCompletionModel`, `CandleInferenceResult`, `CandleLlmError`, `CandleLlmOptions` |
| `candle-embed` | `CandleEmbedModel`, `CandleEmbedOptions`, `CandleEmbedError` |
| `embed` | `EmbedModel`, `EmbedOptions`, `EmbedResponse`, `EmbedError` |
| `whispercpp` | `WhisperCppProvider`, `WhisperModel`, `WhisperOptions`, `WhisperError` |
| `piper` | `PiperProvider`, `PiperOptions`, `PiperError` |
| `diffusion` | `DiffusionProvider`, `DiffusionOptions`, `DiffusionScheduler`, `DiffusionError` |

`blazen-telemetry` exporters:

| Feature | Description |
|---------|-------------|
| `spans` (default) | Enables `TracingCompletionModel` and per-span instrumentation hooks |
| `history` | Enables `WorkflowHistory`, `HistoryEvent`, `HistoryEventKind`, `PauseReason` |
| `otlp` | OTLP **gRPC** exporter via `tonic` (`init_otlp` + `OtlpConfig`). Native targets only |
| `otlp-http` | OTLP **HTTP/protobuf** exporter via a custom `HttpClient` (`init_otlp_http` + `OtlpConfig`). Works on native **and** wasm32 |
| `prometheus` | Enables `init_prometheus` + `MetricsLayer` |
| `langfuse` | Enables `LangfuseConfig`, `LangfuseLayer`, `init_langfuse` |
| `all` | Enables `spans`, `history`, `otlp`, `otlp-http`, `prometheus`, `langfuse` |

---

## Core LLM Traits

### `CompletionModel`

The central trait every LLM provider must implement. Supports both one-shot and streaming completions.

```rust
#[async_trait]
pub trait CompletionModel: Send + Sync {
    fn model_id(&self) -> &str;

    async fn complete(
        &self,
        request: CompletionRequest,
    ) -> Result<CompletionResponse, BlazenError>;

    async fn stream(
        &self,
        request: CompletionRequest,
    ) -> Result<
        Pin<Box<dyn Stream<Item = Result<StreamChunk, BlazenError>> + Send>>,
        BlazenError,
    >;
}
```

**Usage:**

```rust
use blazen_llm::{CompletionModel, CompletionRequest, ChatMessage};
use blazen_llm::providers::openai::OpenAiProvider;

let model = OpenAiProvider::new("sk-...");
let request = CompletionRequest::new(vec![
    ChatMessage::user("What is 2 + 2?"),
]);
let response = model.complete(request).await?;
println!("{}", response.content.unwrap_or_default());
```

**Streaming:**

```rust
use futures_util::StreamExt;

let request = CompletionRequest::new(vec![
    ChatMessage::user("Tell me a story"),
]);
let mut stream = model.stream(request).await?;
while let Some(chunk) = stream.next().await {
    let chunk = chunk?;
    if let Some(delta) = &chunk.delta {
        print!("{delta}");
    }
}
```

---

### `StructuredOutput`

Extract typed data from a model using JSON Schema constraints. This trait has a **blanket implementation** for every `CompletionModel` -- providers do not need to implement it.

```rust
#[async_trait]
pub trait StructuredOutput: CompletionModel {
    async fn extract<T: JsonSchema + DeserializeOwned + Send>(
        &self,
        messages: Vec<ChatMessage>,
    ) -> Result<StructuredResponse<T>, BlazenError>;
}

// Blanket impl: every CompletionModel automatically gets this.
impl<M: CompletionModel> StructuredOutput for M {}
```

`T` must implement `schemars::JsonSchema` and `serde::de::DeserializeOwned`. The schema is derived at call time via `schemars::schema_for!` and injected into the request's `response_format`.

**Usage:**

```rust
use schemars::JsonSchema;
use serde::Deserialize;
use blazen_llm::StructuredOutput;

#[derive(JsonSchema, Deserialize)]
struct Sentiment {
    label: String,
    score: f64,
}

let result = model.extract::<Sentiment>(vec![
    ChatMessage::user("Analyze sentiment: 'I love Rust'"),
]).await?;
println!("{}: {}", result.data.label, result.data.score);
```

---

### `EmbeddingModel`

Produces vector embeddings for text inputs.

```rust
#[async_trait]
pub trait EmbeddingModel: Send + Sync {
    fn model_id(&self) -> &str;
    fn dimensions(&self) -> usize;
    async fn embed(&self, texts: &[String]) -> Result<EmbeddingResponse, BlazenError>;
}
```

**Usage:**

```rust
let texts = vec!["Hello world".into(), "Goodbye world".into()];
let response = embedding_model.embed(&texts).await?;
for (i, vector) in response.embeddings.iter().enumerate() {
    println!("text {i}: {} dimensions", vector.len());
}
```

---

### `Tool`

A callable tool that can be invoked by an LLM during a conversation.

```rust
#[async_trait]
pub trait Tool: Send + Sync {
    fn definition(&self) -> ToolDefinition;
    async fn execute(
        &self,
        arguments: serde_json::Value,
    ) -> Result<ToolOutput<serde_json::Value>, BlazenError>;
}
```

:::note[Migration from earlier versions]
The `Tool::execute` return type changed: it used to return a bare `Result<…Value…, BlazenError>` and now returns `Result<ToolOutput<Value>, BlazenError>`. Existing tools that returned a `Value` continue to compile by changing `Ok(value)` to `Ok(value.into())` -- the `From<Value>` impl wraps it into a `ToolOutput` with no override. The new [`ChatMessage::tool_result`](#chatmessage) constructor is also a breaking change for callers that passed a `&str`; convert via `serde_json::Value::String(s.into())` or use `serde_json::json!(s)`.
:::

**Usage:**

```rust
use blazen_llm::{Tool, ToolDefinition, ToolOutput, BlazenError};
use async_trait::async_trait;

struct WeatherTool;

#[async_trait]
impl Tool for WeatherTool {
    fn definition(&self) -> ToolDefinition {
        ToolDefinition {
            name: "get_weather".into(),
            description: "Get the weather for a city.".into(),
            parameters: serde_json::json!({
                "type": "object",
                "properties": { "city": { "type": "string" } },
                "required": ["city"],
            }),
        }
    }

    async fn execute(
        &self,
        args: serde_json::Value,
    ) -> Result<ToolOutput<serde_json::Value>, BlazenError> {
        let _city = args["city"].as_str().unwrap_or_default();
        // Common case: return a structured value, no override.
        Ok(serde_json::json!({ "temperature_f": 72, "conditions": "clear" }).into())
    }
}
```

**Sending a summary to the LLM while keeping the full payload visible to callers:**

```rust
use blazen_llm::{Tool, ToolDefinition, ToolOutput, LlmPayload, BlazenError};
use async_trait::async_trait;

# struct SearchTool;
# #[async_trait]
# impl Tool for SearchTool {
#     fn definition(&self) -> ToolDefinition { unimplemented!() }
async fn execute(
    &self,
    args: serde_json::Value,
) -> Result<ToolOutput<serde_json::Value>, BlazenError> {
    Ok(ToolOutput::with_override(
        serde_json::json!({ "items": [1, 2, 3], "raw": "..." }),
        LlmPayload::Text { text: "Found 3 items.".into() },
    ))
}
# }
```

The full `data` payload is preserved in [`ChatMessage.tool_result`](#chatmessage) so application code can inspect the unredacted result, while only the `llm_override` is sent to the model on the next turn.

---

### `TypedTool`

A generic wrapper that turns a typed handler `Fn(Args) -> Future<Output = Result<ToolOutput<Output>>>` into an implementation of `Tool`. Handles `serde_json::from_value` of the input and `serde_json::to_value` of the output for you, and auto-derives the JSON Schema in `ToolDefinition::parameters` via `schemars::schema_for!`.

```rust
pub struct TypedTool<Args, Output, F>
where
    Args: DeserializeOwned + JsonSchema + Send + 'static,
    Output: Serialize + Send + 'static,
    F: Fn(Args) -> BoxFut<Output> + Send + Sync + 'static,
{ /* ... */ }

impl<Args, Output, F> TypedTool<Args, Output, F> {
    pub fn new(
        name: impl Into<String>,
        description: impl Into<String>,
        handler: F,
    ) -> Self;
}
```

**Usage:**

```rust
use blazen_llm::{TypedTool, ToolOutput};
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};

#[derive(Deserialize, JsonSchema)]
struct AddArgs { a: i64, b: i64 }

#[derive(Serialize)]
struct AddOutput { sum: i64 }

let add_tool = TypedTool::new(
    "add",
    "Add two numbers.",
    |args: AddArgs| {
        Box::pin(async move {
            Ok(ToolOutput::new(AddOutput { sum: args.a + args.b }))
        })
    },
);
```

#### `typed_tool_simple`

Convenience constructor for the no-override common case. The handler returns `Result<Output>` directly; the wrapper applies `ToolOutput::new` for you.

```rust
pub fn typed_tool_simple<Args, Output, Fut, F>(
    name: impl Into<String>,
    description: impl Into<String>,
    handler: F,
) -> impl Tool
where
    Args: DeserializeOwned + JsonSchema + Send + 'static,
    Output: Serialize + Send + 'static,
    Fut: Future<Output = Result<Output, BlazenError>> + Send + 'static,
    F: Fn(Args) -> Fut + Send + Sync + 'static;
```

**Usage:**

```rust
use blazen_llm::{typed_tool_simple, BlazenError};

let add_tool = typed_tool_simple(
    "add",
    "Add two numbers.",
    |args: AddArgs| async move {
        Ok::<_, BlazenError>(AddOutput { sum: args.a + args.b })
    },
);
```

Why it exists: `TypedTool` does the `serde_json::from_value` of the input and `serde_json::to_value` of the output for you, exactly once per call. The JSON Schema in `ToolDefinition::parameters` is auto-derived from `Args` via `schemars::schema_for!`, so you do not have to hand-write the schema or repeat field names.

---

### `ToolOutput`

The return type of `Tool::execute`. Carries the structured `data` that callers see, plus an optional `llm_override` controlling what the LLM receives on the next turn.

```rust
pub struct ToolOutput<T = Value> {
    pub data: T,
    pub llm_override: Option<LlmPayload>,
}
```

| Field | Type | Description |
|-------|------|-------------|
| `data` | `T` | Structured tool output. Visible to application code via [`ChatMessage.tool_result`](#chatmessage). |
| `llm_override` | `Option<LlmPayload>` | If `Some`, replaces the default representation of `data` when serialised into the next prompt. If `None`, each provider applies a sensible default (see [`LlmPayload`](#llmpayload)). |

**Constructors:**

| Constructor | Signature | Description |
|-------------|-----------|-------------|
| `ToolOutput::new` | `fn(data: T) -> Self` | Wrap structured data with no override. The LLM sees the provider default. |
| `ToolOutput::with_override` | `fn(data: T, override_payload: LlmPayload) -> Self` | Wrap structured data and pin exactly what the LLM receives next turn. |
| `From<Value>` | `impl From<Value> for ToolOutput<Value>` | `value.into()` produces `ToolOutput { data: value, llm_override: None }`. Lets `Tool::execute` keep returning bare `Value`s with `Ok(value.into())`. |

**Usage:**

```rust
use blazen_llm::{ToolOutput, LlmPayload};
use serde_json::json;

let plain = ToolOutput::new(json!({ "items": [1, 2, 3] }));

let with_summary = ToolOutput::with_override(
    json!({ "items": [1, 2, 3], "raw": "..." }),
    LlmPayload::Text { text: "Found 3 items.".into() },
);
```

Why it exists: many tools return large structured payloads that the application wants in full (logs, UI, downstream steps), but feeding all of it back into the next LLM call is wasteful or noisy. `ToolOutput` decouples the two channels so you can return rich `data` to the caller while pinning a compact summary for the model.

---

### `LlmPayload`

The wire-format-agnostic shape of a tool result as it appears to the LLM. Used as `ToolOutput::llm_override` and as the second component of [`ChatMessage::tool_result_view`](#chatmessage).

```rust
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum LlmPayload {
    Text { text: String },
    Json { value: serde_json::Value },
    Parts { parts: Vec<ContentPart> },
    ProviderRaw { provider: ProviderId, value: serde_json::Value },
}
```

| Variant | Description |
|---------|-------------|
| `Text { text }` | Plain text. Sent as-is to providers that accept string tool results, or wrapped in `[{type: "text", text}]` for providers that require parts. |
| `Json { value }` | Structured JSON. Each provider serialises this in its native shape (see per-provider behaviour below). |
| `Parts { parts }` | A `Vec<ContentPart>` for multimodal tool results (text + images + files). Used together with [`ChatMessage::tool_result_parts`](#chatmessage). |
| `ProviderRaw { provider, value }` | An exact, provider-specific payload. Bypasses Blazen's translation layer and is forwarded verbatim only when the active provider matches `provider`; other providers fall back to the default representation of `data`. |

**Usage:**

```rust
use blazen_llm::{LlmPayload, ProviderId};
use serde_json::json;

LlmPayload::Text { text: "Found 3 results.".into() };
LlmPayload::Json { value: json!({ "items": [1, 2, 3] }) };
LlmPayload::ProviderRaw {
    provider: ProviderId::Anthropic,
    value: json!([{"type": "text", "text": "..."}]),
};
```

**Per-provider behaviour for the default (no `llm_override`) case:**

When a tool returns structured `data` and no `llm_override`, each provider sends a sensible default to the LLM:

- **OpenAI / OpenAI-compat / Azure / Responses / Fal**: the data is JSON-stringified into the `content` field of the tool message.
- **Anthropic**: structured data becomes `[{type: "text", text: <stringified-json>}]` inside `tool_result.content`.
- **Gemini**: structured object data is passed natively as `functionResponse.response`. Scalars are wrapped as `{result: <scalar>}`.

---

### `ProviderId`

Tags an [`LlmPayload::ProviderRaw`](#llmpayload) variant with the provider whose wire format the value follows. The runtime uses this to decide whether to forward the raw payload or fall back to the default representation.

```rust
pub enum ProviderId {
    OpenAi,
    OpenAiCompat,
    Azure,
    Anthropic,
    Gemini,
    Responses,
    Fal,
}
```

| Variant | Provider |
|---------|----------|
| `OpenAi` | `OpenAiProvider` |
| `OpenAiCompat` | `OpenAiCompatProvider` (OpenRouter, Groq, Together, etc.) |
| `Azure` | `AzureOpenAiProvider` |
| `Anthropic` | `AnthropicProvider` |
| `Gemini` | `GeminiProvider` |
| `Responses` | OpenAI Responses API provider |
| `Fal` | `FalProvider` |

---

### `ModelRegistry`

Allows providers to advertise their available models.

```rust
#[async_trait]
pub trait ModelRegistry: Send + Sync {
    async fn list_models(&self) -> Result<Vec<ModelInfo>, BlazenError>;
    async fn get_model(&self, model_id: &str) -> Result<Option<ModelInfo>, BlazenError>;
}
```

#### `ModelInfo`

| Field | Type | Description |
|-------|------|-------------|
| `id` | `String` | Model identifier used in API requests (e.g. `"gpt-4o"`) |
| `name` | `Option<String>` | Human-readable display name |
| `provider` | `String` | Provider that serves this model |
| `context_length` | `Option<u64>` | Maximum context window in tokens |
| `pricing` | `Option<ModelPricing>` | Pricing information |
| `capabilities` | `ModelCapabilities` | What this model can do |

#### `ModelPricing`

| Field | Type | Description |
|-------|------|-------------|
| `input_per_million` | `Option<f64>` | Cost per million input tokens (USD) |
| `output_per_million` | `Option<f64>` | Cost per million output tokens (USD) |
| `per_image` | `Option<f64>` | Cost per image (image generation models) |
| `per_second` | `Option<f64>` | Cost per second of compute |

#### `ModelCapabilities`

| Field | Type | Description |
|-------|------|-------------|
| `chat` | `bool` | Supports chat completions |
| `streaming` | `bool` | Supports streaming responses |
| `tool_use` | `bool` | Supports tool/function calling |
| `structured_output` | `bool` | Supports JSON schema constraints |
| `vision` | `bool` | Supports image inputs |
| `image_generation` | `bool` | Supports image generation |
| `embeddings` | `bool` | Supports text embeddings |
| `video_generation` | `bool` | Video generation support |
| `text_to_speech` | `bool` | Text-to-speech synthesis |
| `speech_to_text` | `bool` | Speech-to-text transcription |
| `audio_generation` | `bool` | Audio generation (music, SFX) |
| `three_d_generation` | `bool` | 3D model generation |

---

## Types

### `ChatMessage`

A single message in a chat conversation.

| Field | Type | Description |
|-------|------|-------------|
| `role` | `Role` | Who produced this message |
| `content` | `MessageContent` | The message payload. For tool-result messages with a plain string return, the string lives here as `MessageContent::Text(s)`. |
| `tool_calls` | `Vec<ToolCall>` | Tool invocations requested by the assistant on this message (empty for non-assistant roles). |
| `tool_result` | `Option<ToolOutput<serde_json::Value>>` | Structured tool-result payload for tool-role messages. `Some` only when the tool returned non-string `data` or supplied an `llm_override`. Plain-string results live in `content` instead. |
| `name` | `Option<String>` | Tool name (set for tool-role messages). |
| `tool_call_id` | `Option<String>` | Provider-assigned `id` from the originating [`ToolCall`](#toolcall). |

**Constructors:**

```rust
// Text messages
ChatMessage::system("You are a helpful assistant")
ChatMessage::user("Hello!")
ChatMessage::assistant("Hi there!")
ChatMessage::tool("{ \"result\": 42 }")

// Multimodal messages
ChatMessage::user_image_url("Describe this", "https://img.com/a.png", Some("image/png"))
ChatMessage::user_image_base64("What is this?", "iVBORw0K...", "image/jpeg")
ChatMessage::user_parts(vec![
    ContentPart::Text { text: "Look at this:".into() },
    ContentPart::Image(ImageContent {
        source: ImageSource::Url { url: "https://...".into() },
        media_type: Some("image/png".into()),
    }),
    ContentPart::File(FileContent {
        source: ImageSource::Url { url: "https://...".into() },
        media_type: "application/pdf".into(),
        filename: Some("doc.pdf".into()),
    }),
])
```

#### `ChatMessage::tool_result`

Build a tool-role message that closes a prior [`ToolCall`](#toolcall). Routes plain strings into `content` and structured payloads (or anything with an `llm_override`) onto the new `tool_result` sibling field.

```rust
pub fn tool_result(
    call_id: impl Into<String>,
    name: impl Into<String>,
    output: impl Into<ToolOutput<serde_json::Value>>,
) -> Self
```

| Argument | Description |
|----------|-------------|
| `call_id` | The `id` from the originating [`ToolCall`](#toolcall). Stored on `tool_call_id`. |
| `name` | The tool name. Stored on `name`. |
| `output` | A `ToolOutput<Value>` (or anything that converts via `From<Value>`, e.g. `serde_json::Value` directly). |

**Routing rules:**

- If `output.data == Value::String(s)` **and** `output.llm_override.is_none()`, the string is moved into `content` as `MessageContent::Text(s)` and `tool_result` stays `None`.
- Otherwise, `output` is stored verbatim on `tool_result` and `content` is set to `MessageContent::Text(String::new())`.

**Usage:**

```rust
use blazen_llm::{ChatMessage, ToolOutput, LlmPayload};
use serde_json::json;

// Plain string -- lives in content as a regular text message.
ChatMessage::tool_result("call_1", "search", json!("hello"));

// Structured -- lives in the tool_result sibling field.
ChatMessage::tool_result("call_1", "search", json!({"items": [1, 2, 3]}));

// With override -- full data preserved on the message,
// summary sent to the LLM next turn.
ChatMessage::tool_result(
    "call_1",
    "search",
    ToolOutput::with_override(
        json!({"items": [1, 2, 3], "raw": "..."}),
        LlmPayload::Text { text: "Found 3 items.".into() },
    ),
);
```

:::caution[Breaking change]
Earlier versions accepted `content: impl Into<String>` and stored the string verbatim. The new signature accepts `output: impl Into<ToolOutput<Value>>`. Callers that passed a `&str` should switch to `serde_json::Value::String(s.into())` or `serde_json::json!(s)`.
:::

#### `ChatMessage::tool_result_parts`

Build a tool-role message whose result carries multimodal content (text + images + files). The parts ride on `tool_result` as `LlmPayload::Parts { parts }` so providers that support multimodal tool results (Anthropic, Gemini) can forward them natively.

```rust
pub fn tool_result_parts(
    call_id: impl Into<String>,
    name: impl Into<String>,
    parts: Vec<ContentPart>,
) -> Self
```

**Usage:**

```rust
use blazen_llm::{ChatMessage, ContentPart, ImageContent, ImageSource};

ChatMessage::tool_result_parts(
    "call_1",
    "render_chart",
    vec![
        ContentPart::Text { text: "Rendered the requested chart.".into() },
        ContentPart::Image(ImageContent {
            source: ImageSource::Url { url: "https://example.com/chart.png".into() },
            media_type: Some("image/png".into()),
        }),
    ],
);
```

#### `ChatMessage::tool_result_view`

Accessor returning both channels of a tool-result message in a single call. Used internally by provider implementations that need to choose between the structured `data` and an explicit `llm_override` when serialising to the wire format.

```rust
pub fn tool_result_view(&self) -> Option<(serde_json::Value, Option<&LlmPayload>)>
```

Returns `None` for non-tool-role messages. For tool-role messages it returns `Some((data, override))` where `data` is the raw `serde_json::Value` payload (drawn from `tool_result.data` when present, otherwise reconstructed from the plain-string `content`) and `override` is the optional `&LlmPayload` from `tool_result.llm_override`.

---

### `Role`

```rust
pub enum Role {
    System,
    User,
    Assistant,
    Tool,
}
```

---

### `MessageContent`

```rust
pub enum MessageContent {
    Text(String),
    Image(ImageContent),
    Parts(Vec<ContentPart>),
}
```

| Method | Signature | Description |
|--------|-----------|-------------|
| `as_text()` | `&self -> Option<&str>` | Return the text if this is a `Text` variant |
| `as_parts()` | `&self -> Vec<ContentPart>` | Convert any variant into a `Vec<ContentPart>` |
| `text_content()` | `&self -> Option<String>` | Extract and concatenate all text content |

`MessageContent` implements `From<&str>` and `From<String>`.

---

### `ContentPart`

```rust
pub enum ContentPart {
    Text { text: String },
    Image(ImageContent),
    File(FileContent),
}
```

### `ImageContent`

| Field | Type | Description |
|-------|------|-------------|
| `source` | `ImageSource` | URL or base64 data |
| `media_type` | `Option<String>` | MIME type (e.g. `"image/png"`) |

### `ImageSource`

The source of an image, file, or any other media payload. Marked `#[non_exhaustive]` so new variants can be added without breaking callers -- always pattern-match with a wildcard arm. `MediaSource` is a type alias for `ImageSource` and is the preferred name when the value is not specifically an image.

```rust
#[non_exhaustive]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum ImageSource {
    Url { url: String },
    Base64 { data: String },
    File { path: PathBuf },
    ProviderFile { provider: ProviderId, id: String },
    Handle { handle: ContentHandle },
}

pub type MediaSource = ImageSource;
```

| Variant | Description |
|---------|-------------|
| `Url { url }` | Public URL the provider can fetch directly. |
| `Base64 { data }` | Inline base64-encoded bytes. Pair with `media_type` on the enclosing [`ImageContent`](#imagecontent) / [`FileContent`](#filecontent). |
| `File { path }` | Local filesystem path. Use `ImageSource::file(path)` to construct. Resolved at request time by a [`ContentStore`](#contentstore) or by the provider adapter. |
| `ProviderFile { provider, id }` | Reference to a file already uploaded to a provider's Files API (e.g. an OpenAI `file-xxx` id). Forwarded verbatim only when the active provider matches `provider`. |
| `Handle { handle }` | Reference to a [`ContentHandle`](#contenthandle) registered with a [`ContentStore`](#contentstore). Resolved into a concrete `Url` / `Base64` / `ProviderFile` by [`CompletionRequest::resolve_handles_with`](#completionrequestresolve_handles_with) before the request hits a provider. |

### `FileContent`

| Field | Type | Description |
|-------|------|-------------|
| `source` | `ImageSource` | URL or base64 data |
| `media_type` | `String` | MIME type (e.g. `"application/pdf"`) |
| `filename` | `Option<String>` | Optional filename for display |

---

### `CompletionRequest`

A provider-agnostic request for a chat completion.

| Field | Type | Description |
|-------|------|-------------|
| `messages` | `Vec<ChatMessage>` | The conversation history |
| `tools` | `Vec<ToolDefinition>` | Tools available for the model to invoke |
| `temperature` | `Option<f32>` | Sampling temperature (0.0 = deterministic, 2.0 = very random) |
| `max_tokens` | `Option<u32>` | Maximum number of tokens to generate |
| `top_p` | `Option<f32>` | Nucleus sampling parameter |
| `response_format` | `Option<serde_json::Value>` | JSON Schema for structured output |
| `model` | `Option<String>` | Override the provider's default model |
| `modalities` | `Option<Vec<String>>` | Output modalities (e.g. `["text"]`, `["image", "text"]`) |
| `image_config` | `Option<serde_json::Value>` | Image generation configuration (model-specific) |
| `audio_config` | `Option<serde_json::Value>` | Audio output configuration (voice, format, etc.) |

**Builder pattern:**

```rust
let request = CompletionRequest::new(vec![ChatMessage::user("Hello")])
    .with_tools(tool_defs)
    .with_temperature(0.7)
    .with_max_tokens(1024)
    .with_top_p(0.9)
    .with_response_format(schema_json)
    .with_model("gpt-4o")
    .with_modalities(vec!["text".into(), "image".into()])
    .with_image_config(serde_json::json!({ "size": "1024x1024" }))
    .with_audio_config(serde_json::json!({ "voice": "alloy" }));
```

---

### `CompletionResponse`

The result of a non-streaming chat completion.

| Field | Type | Description |
|-------|------|-------------|
| `content` | `Option<String>` | Text content of the assistant's reply |
| `tool_calls` | `Vec<ToolCall>` | Tool invocations requested by the model |
| `usage` | `Option<TokenUsage>` | Token usage statistics |
| `model` | `String` | The model that produced this response |
| `finish_reason` | `Option<String>` | Why the model stopped (e.g. `"stop"`, `"tool_use"`) |
| `cost` | `Option<f64>` | Estimated cost in USD |
| `timing` | `Option<RequestTiming>` | Request timing breakdown |
| `images` | `Vec<GeneratedImage>` | Generated images (multimodal models) |
| `audio` | `Vec<GeneratedAudio>` | Generated audio (TTS / multimodal) |
| `videos` | `Vec<GeneratedVideo>` | Generated videos |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

---

### `StructuredResponse<T>`

Response from structured output extraction, preserving metadata.

| Field | Type | Description |
|-------|------|-------------|
| `data` | `T` | The extracted structured data |
| `usage` | `Option<TokenUsage>` | Token usage statistics |
| `model` | `String` | The model that produced this response |
| `cost` | `Option<f64>` | Estimated cost in USD |
| `timing` | `Option<RequestTiming>` | Request timing |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

---

### `EmbeddingResponse`

Response from an embedding operation.

| Field | Type | Description |
|-------|------|-------------|
| `embeddings` | `Vec<Vec<f32>>` | The embedding vectors (one per input text) |
| `model` | `String` | The model used |
| `usage` | `Option<TokenUsage>` | Token usage statistics |
| `cost` | `Option<f64>` | Estimated cost in USD |
| `timing` | `Option<RequestTiming>` | Request timing |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

---

### `RequestTiming`

Timing metadata for a request.

| Field | Type | Description |
|-------|------|-------------|
| `queue_ms` | `Option<u64>` | Time spent waiting in queue (ms) |
| `execution_ms` | `Option<u64>` | Time spent executing the request (ms) |
| `total_ms` | `Option<u64>` | Total wall-clock time from submit to response (ms) |

---

### `TokenUsage`

Token usage statistics for a completion request.

| Field | Type | Description |
|-------|------|-------------|
| `prompt_tokens` | `u32` | Tokens in the prompt / input |
| `completion_tokens` | `u32` | Tokens in the completion / output |
| `total_tokens` | `u32` | Total tokens consumed (prompt + completion) |

---

### `ToolDefinition`

Describes a tool that the model may invoke.

| Field | Type | Description |
|-------|------|-------------|
| `name` | `String` | Unique name of the tool |
| `description` | `String` | Human-readable description |
| `parameters` | `serde_json::Value` | JSON Schema describing the tool's input parameters |

---

### `ToolCall`

A tool invocation requested by the model.

| Field | Type | Description |
|-------|------|-------------|
| `id` | `String` | Provider-assigned identifier for this invocation |
| `name` | `String` | Name of the tool to invoke |
| `arguments` | `serde_json::Value` | Arguments to pass, as JSON |

---

### `StreamChunk`

A single chunk from a streaming completion response.

| Field | Type | Description |
|-------|------|-------------|
| `delta` | `Option<String>` | Incremental text content |
| `tool_calls` | `Vec<ToolCall>` | Tool invocations completed in this chunk |
| `finish_reason` | `Option<String>` | Present in the final chunk to indicate why generation stopped |

---

## Content Subsystem

A provider-agnostic layer for handing media (images, audio, video, documents, 3D, CAD, archives, fonts, code, generic data) to and from models. The core idea: instead of inlining bytes or URLs in every message, callers register payloads with a [`ContentStore`](#contentstore), receive a small [`ContentHandle`](#contenthandle), and reference the handle from messages, tool inputs, and tool outputs. Just before a request is sent, the runtime resolves every handle into the concrete representation the active provider expects -- a URL, a base64 blob, or a `ProviderFile` reference for providers with native Files APIs.

Everything in this section is re-exported from `blazen_llm::content`.

### `ContentKind`

Coarse classification for a piece of content. Marked `#[non_exhaustive]`. Serde uses `rename_all = "snake_case"`, so `ThreeDModel` round-trips as `"three_d_model"`. Implements `Display`.

```rust
#[non_exhaustive]
pub enum ContentKind {
    Image,
    Audio,
    Video,
    Document,
    ThreeDModel,
    Cad,
    Archive,
    Font,
    Code,
    Data,
    Other,
}
```

| Method | Signature | Description |
|--------|-----------|-------------|
| `from_mime` | `fn(&str) -> Self` | Best-effort classification from a MIME string. |
| `from_extension` | `fn(&str) -> Self` | Best-effort classification from a filename extension (no leading dot required). |
| `as_str` | `fn(self) -> &'static str` | Snake-case string form (matches the serde representation). |

```rust
use blazen_llm::content::ContentKind;

assert_eq!(ContentKind::from_mime("image/png"), ContentKind::Image);
assert_eq!(ContentKind::from_extension("glb"), ContentKind::ThreeDModel);
assert_eq!(ContentKind::ThreeDModel.as_str(), "three_d_model");
```

---

### `ContentHandle`

An opaque pointer to a payload owned by a [`ContentStore`](#contentstore). Cheap to clone, safe to embed in messages and tool arguments, and resolvable on demand.

```rust
pub struct ContentHandle {
    pub id: String,
    pub kind: ContentKind,
    pub mime_type: Option<String>,
    pub byte_size: Option<u64>,
    pub display_name: Option<String>,
}
```

| Method | Signature | Description |
|--------|-----------|-------------|
| `new` | `fn(id: impl Into<String>, kind: ContentKind) -> Self` | Construct a handle with no metadata. |
| `with_mime_type` | `fn(self, mime: impl Into<String>) -> Self` | Builder: attach a MIME type. |
| `with_byte_size` | `fn(self, bytes: u64) -> Self` | Builder: attach a byte size. |
| `with_display_name` | `fn(self, name: impl Into<String>) -> Self` | Builder: attach a human-readable name. |

```rust
use blazen_llm::content::{ContentHandle, ContentKind};

let handle = ContentHandle::new("blob_abc123", ContentKind::Image)
    .with_mime_type("image/png")
    .with_byte_size(48_213)
    .with_display_name("chart.png");
```

---

### `ContentStore`

The async trait every store implements. Stores own the bytes (or the URL, or the provider-side file id) and translate handles into something the provider can consume on demand.

```rust
#[async_trait]
pub trait ContentStore: Send + Sync + std::fmt::Debug {
    async fn put(&self, body: ContentBody, hint: ContentHint)
        -> Result<ContentHandle, BlazenError>;
    async fn resolve(&self, handle: &ContentHandle)
        -> Result<MediaSource, BlazenError>;
    async fn fetch_bytes(&self, handle: &ContentHandle)
        -> Result<Vec<u8>, BlazenError>;
    async fn metadata(&self, handle: &ContentHandle)
        -> Result<ContentMetadata, BlazenError> { /* default */ }
    async fn fetch_stream(&self, handle: &ContentHandle)
        -> Result<ByteStream, BlazenError> { /* default */ }
    async fn delete(&self, _handle: &ContentHandle)
        -> Result<(), BlazenError> { Ok(()) }
}
```

| Method | Description |
|--------|-------------|
| `put` | Ingest a [`ContentBody`](#contentbody) under a [`ContentHint`](#contenthint) and return a fresh handle. |
| `resolve` | Produce the concrete [`MediaSource`](#imagesource) the provider will see (typically `Url`, `Base64`, or `ProviderFile`). Called by [`CompletionRequest::resolve_handles_with`](#completionrequestresolve_handles_with). |
| `fetch_bytes` | Materialise the underlying bytes. Used when a tool needs to read the payload directly. |
| `metadata` | Return [`ContentMetadata`](#contentmetadata). The default impl synthesises this from the handle's own fields; stores with richer indices should override. |
| `fetch_stream` | Stream raw bytes back as a [`ByteStream`](#bytestream). The default impl buffers `fetch_bytes` into a single chunk via `futures::stream::once`, so existing impls keep working unchanged. Stores backed by HTTP / disk / object storage should override for true incremental streaming. Built-in overrides today: [`LocalFileContentStore`](#built-in-stores) (uses `tokio_util::io::ReaderStream`), [`OpenAiFilesStore`](#built-in-stores), [`AnthropicFilesStore`](#built-in-stores), [`FalStorageStore`](#built-in-stores) (all via the `HttpClient` trait's `send_streaming` method). [`InMemoryContentStore`](#built-in-stores) and [`GeminiFilesStore`](#built-in-stores) use the buffered default impl. |
| `delete` | Best-effort deletion. Default is a no-op so read-only stores can leave it unimplemented. |

```rust
use std::sync::Arc;
use blazen_llm::content::{ContentBody, ContentHint, ContentKind, InMemoryContentStore, ContentStore};

let store = Arc::new(InMemoryContentStore::new());

let handle = store.put(
    ContentBody::Bytes { data: std::fs::read("chart.png")? },
    ContentHint::default()
        .with_mime_type("image/png")
        .with_kind(ContentKind::Image)
        .with_display_name("chart.png"),
).await?;

let source = store.resolve(&handle).await?;
```

---

### `ContentBody`

The five ways a caller can hand bytes (or a pointer to bytes) to a [`ContentStore`](#contentstore) via `put`. Variants are struct-form (named fields), and serde uses an internally-tagged representation (`tag = "type"`, `rename_all = "snake_case"`).

```rust
#[derive(Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum ContentBody {
    Bytes { data: Vec<u8> },
    Url { url: String },
    LocalPath { path: PathBuf },
    ProviderFile { provider: ProviderId, id: String },
    /// Streaming byte source. The store consumes the stream during `put`
    /// and is free to spool to disk, forward as a chunked upload, or
    /// drain into bytes.
    #[serde(skip)]
    Stream {
        stream: ByteStream,
        size_hint: Option<u64>,
    },
}
```

| Variant | Description |
|---------|-------------|
| `Bytes { data }` | In-memory payload. The store decides whether to keep it in RAM, spill to disk, or upload to a provider. |
| `Url { url }` | Remote URL the store may fetch lazily or pass through verbatim on resolve. |
| `LocalPath { path }` | Local filesystem path. Native-only stores (e.g. [`LocalFileContentStore`](#built-in-stores)) can index it without copying. |
| `ProviderFile { provider, id }` | A pre-existing file id on a provider's Files API. Lets you wrap an externally uploaded asset in a handle without re-uploading. |
| `Stream { stream, size_hint }` | Streaming byte source built on [`ByteStream`](#bytestream). Stores with a true streaming upload path (filesystem, S3, HTTP multipart) consume the stream incrementally; memory-bound stores buffer. `size_hint` carries the total length when known up front (e.g. from a `Content-Length` header) so stores can pre-allocate or pick between simple and resumable upload paths. |

`ContentBody::Stream` is **not** `Clone` (a [`ByteStream`](#bytestream) is single-use) and **not** `Serialize` -- the variant is annotated `#[serde(skip)]` because [`ByteStream`](#bytestream) implements neither `Serialize` nor `Deserialize`. The manual `Clone` impl panics with `unreachable!` if you clone a `Stream` variant; consume streaming bodies by value. Bindings that route `ContentBody` through `serde_json` must check for `Stream` first and handle it on a separate path.

```rust
use blazen_llm::content::ContentBody;

let from_memory = ContentBody::Bytes { data: b"hello".to_vec() };
let from_disk   = ContentBody::LocalPath { path: "./report.pdf".into() };
let from_url    = ContentBody::Url { url: "https://example.com/a.png".into() };
```

---

### `ByteStream`

A pinned, boxed, fallible stream of byte chunks. Used by [`ContentBody::Stream`](#contentbody) for streaming uploads and by [`ContentStore::fetch_stream`](#contentstore) for streaming downloads.

```rust
pub type ByteStream = std::pin::Pin<
    Box<dyn futures_core::Stream<Item = Result<bytes::Bytes, BlazenError>> + Send>
>;
```

Stores backed by HTTP, S3, or the filesystem should produce / consume these incrementally; memory-bound stores may buffer.

---

### `ContentHint`

Optional metadata callers pass alongside a [`ContentBody`](#contentbody) so the store can pick a sensible MIME type, classification, and display name without re-sniffing. Implements `Default`.

```rust
pub struct ContentHint {
    pub mime_type: Option<String>,
    pub kind_hint: Option<ContentKind>,
    pub display_name: Option<String>,
    pub byte_size: Option<u64>,
}
```

| Method | Signature | Description |
|--------|-----------|-------------|
| `with_mime_type` | `fn(self, mime: impl Into<String>) -> Self` | Pin the MIME type. |
| `with_kind` | `fn(self, kind: ContentKind) -> Self` | Pin the [`ContentKind`](#contentkind). |
| `with_display_name` | `fn(self, name: impl Into<String>) -> Self` | Pin a human-readable name. |
| `with_byte_size` | `fn(self, bytes: u64) -> Self` | Pin the byte size when known up front (e.g. from a `Content-Length` header). |

```rust
use blazen_llm::content::{ContentHint, ContentKind};

let hint = ContentHint::default()
    .with_mime_type("audio/wav")
    .with_kind(ContentKind::Audio)
    .with_display_name("note.wav");
```

---

### `ContentMetadata`

The non-id fields of a [`ContentHandle`](#contenthandle), returned by `ContentStore::metadata`. Useful for cheap introspection without materialising bytes.

```rust
pub struct ContentMetadata {
    pub kind: ContentKind,
    pub mime_type: Option<String>,
    pub byte_size: Option<u64>,
    pub display_name: Option<String>,
}
```

```rust
use blazen_llm::content::ContentStore;

let meta = store.metadata(&handle).await?;
println!("{} ({} bytes)", meta.kind, meta.byte_size.unwrap_or(0));
```

---

### `DynContentStore`

Convenience alias for the shared, thread-safe form most code passes around.

```rust
pub type DynContentStore = std::sync::Arc<dyn ContentStore>;
```

```rust
use std::sync::Arc;
use blazen_llm::content::{DynContentStore, InMemoryContentStore};

let store: DynContentStore = Arc::new(InMemoryContentStore::new());
```

---

### Built-in stores

| Store | Constructor | Notes |
|-------|-------------|-------|
| [`InMemoryContentStore`](#built-in-stores) | `InMemoryContentStore::new()` (also `Default`) | Bytes / URL / provider-file refs held in a `RwLock`-guarded map. Great for tests and short-lived sessions. |
| [`LocalFileContentStore`](#built-in-stores) | `LocalFileContentStore::new(root: impl Into<PathBuf>) -> Result<Self, BlazenError>` | Native-only (`not(target_arch = "wasm32")`). Persists payloads under `root`; assigns each entry a UUID-derived filename and tracks the index in memory. |
| [`OpenAiFilesStore`](#built-in-stores) | `OpenAiFilesStore::new(api_key)`, `.with_base_url(url)`, `.with_purpose(p)` | Uploads via OpenAI's `/v1/files` API. `purpose` defaults to `"user_data"` (override for assistants / fine-tuning / batch). |
| [`AnthropicFilesStore`](#built-in-stores) | `AnthropicFilesStore::new(api_key)`, `.with_base_url(url)`, `.with_beta_header(h)` | Uploads via Anthropic's Files API. `beta_header` is forwarded as `anthropic-beta`. |
| [`GeminiFilesStore`](#built-in-stores) | `GeminiFilesStore::new(api_key)`, `.with_base_url(url)` | Uploads via Google's Files API and resolves handles to `gs://`/file-uri refs. |
| [`FalStorageStore`](#built-in-stores) | `FalStorageStore::new(api_key)`, `.with_base_url(url)` | Uploads to Fal's storage endpoint. |
| [`CustomContentStore`](#built-in-stores) | `CustomContentStore::builder(name)` -> [`CustomContentStoreBuilder`](#built-in-stores) | Build a store from closures: `.put(...)`, `.resolve(...)`, `.fetch_bytes(...)`, `.fetch_stream(...)` (optional), `.delete(...)` (optional), `.build()`. The `.fetch_stream` callback is optional -- when omitted, the trait's default impl buffers `fetch_bytes` into one chunk via `stream::once`. Lets callers integrate their own blob backend without writing a new trait impl. |

```rust
use std::sync::Arc;
use blazen_llm::content::{
    AnthropicFilesStore, CustomContentStore, InMemoryContentStore,
    LocalFileContentStore, OpenAiFilesStore,
};

let in_mem = Arc::new(InMemoryContentStore::new());
let on_disk = Arc::new(LocalFileContentStore::new("/var/cache/blazen")?);
let openai = Arc::new(OpenAiFilesStore::new(std::env::var("OPENAI_API_KEY")?)
    .with_purpose("user_data"));
let anthropic = Arc::new(AnthropicFilesStore::new(std::env::var("ANTHROPIC_API_KEY")?)
    .with_beta_header("files-api-2025-04-14"));

let custom = Arc::new(
    CustomContentStore::builder("s3-store")
        .put(|body, hint| async move { /* upload, return ContentHandle */ todo!() })
        .resolve(|handle| async move { /* return MediaSource */ todo!() })
        .fetch_bytes(|handle| async move { /* return Vec<u8> */ todo!() })
        .fetch_stream(|handle| Box::pin(async move {
            // OPTIONAL: stream bytes back chunk-by-chunk for large content.
            // When omitted, the default impl buffers fetch_bytes into one chunk.
            use bytes::Bytes;
            use futures_util::stream;
            let chunks: Vec<Result<Bytes, _>> = vec![Ok(Bytes::from_static(b"hello"))];
            Ok(Box::pin(stream::iter(chunks)) as blazen_llm::content::ByteStream)
        }))
        .delete(|handle| async move { /* delete blob */ Ok(()) })
        .build()?,
);
```

---

### Tool-input helpers

Helpers that produce JSON Schema fragments for tool parameters that should accept a [`ContentHandle`](#contenthandle). Each fragment is tagged with `x-blazen-content-ref` so [`resolve_tool_arguments`](#tool-input-helpers) knows where to substitute resolved [`MediaSource`](#imagesource) values before the tool runs.

```rust
pub fn image_input(name: &str, description: &str) -> serde_json::Value;
pub fn audio_input(name: &str, description: &str) -> serde_json::Value;
pub fn video_input(name: &str, description: &str) -> serde_json::Value;
pub fn file_input(name: &str, description: &str) -> serde_json::Value;
pub fn three_d_input(name: &str, description: &str) -> serde_json::Value;
pub fn cad_input(name: &str, description: &str) -> serde_json::Value;

pub fn content_ref_property(
    kind: ContentKind,
    description: &str,
) -> serde_json::Value;

pub fn content_ref_required_object(
    name: &str,
    kind: ContentKind,
    description: &str,
    extra_properties: serde_json::Map<String, serde_json::Value>,
) -> serde_json::Value;

pub async fn resolve_tool_arguments(
    arguments: &mut serde_json::Value,
    schema: &serde_json::Value,
    store: &dyn ContentStore,
) -> Result<usize, BlazenError>;

pub struct KindMismatch {
    pub property: String,
    pub expected: ContentKind,
    pub actual: ContentKind,
}
```

| Helper | Description |
|--------|-------------|
| `image_input` / `audio_input` / `video_input` / `file_input` / `three_d_input` / `cad_input` | Top-level convenience: returns a single-property required object schema for a typed content reference. |
| `content_ref_property` | Schema for one property accepting a [`ContentHandle`](#contenthandle) of the given [`ContentKind`](#contentkind). Use when assembling a custom multi-property schema. |
| `content_ref_required_object` | Build a required object schema mixing one content ref with `extra_properties` (other primitives, enums, etc.). |
| `resolve_tool_arguments` | Walk `arguments` against `schema`, replace every `x-blazen-content-ref` site with the [`MediaSource`](#imagesource) returned by `store.resolve`. Returns the number of substitutions made. |
| `KindMismatch` | Error variant returned when a handle's [`ContentKind`](#contentkind) does not match the schema's declared `expected` kind. |

```rust
use blazen_llm::content::tool_input::{image_input, resolve_tool_arguments};
use serde_json::{json, Map};

let schema = json!({
    "type": "object",
    "properties": {
        "image": image_input("image", "The image to caption"),
        "max_words": { "type": "integer" },
    },
    "required": ["image", "max_words"],
});

let mut args = json!({
    "image": { "id": "blob_abc123", "kind": "image" },
    "max_words": 20,
});

let n = resolve_tool_arguments(&mut args, &schema, store.as_ref()).await?;
// `args["image"]` is now a concrete MediaSource (Url / Base64 / ProviderFile).
println!("rewrote {n} content refs");
```

---

### Visibility helpers

Helpers for the runtime's "what handles is the model actually allowed to see right now?" pass.

```rust
pub fn collect_visible_handles(messages: &[ChatMessage]) -> Vec<ContentHandle>;

pub fn build_handle_directory_system_note(
    handles: &[ContentHandle],
) -> Option<String>;

pub async fn prepare_request_with_store(
    request: &mut CompletionRequest,
    store: &dyn ContentStore,
) -> Result<usize, BlazenError>;
```

| Helper | Description |
|--------|-------------|
| `collect_visible_handles` | Walks `messages` and returns every distinct [`ContentHandle`](#contenthandle) referenced from user/assistant/tool content, deduped first-seen. |
| `build_handle_directory_system_note` | Returns a system-note string listing the visible handles (id, kind, MIME, name) so the model can name them when calling tools. Returns `None` when `handles` is empty -- callers should not append an empty note. |
| `prepare_request_with_store` | One-call convenience: runs [`CompletionRequest::resolve_handles_with`](#completionrequestresolve_handles_with), then builds and prepends the system note. Returns the number of resolved handles. This is what the agent loop calls before dispatching a request. |

```rust
use blazen_llm::content::visibility::{
    collect_visible_handles, build_handle_directory_system_note, prepare_request_with_store,
};

let visible = collect_visible_handles(&request.messages);
if let Some(note) = build_handle_directory_system_note(&visible) {
    println!("would inject system note:\n{note}");
}

let n = prepare_request_with_store(&mut request, store.as_ref()).await?;
println!("resolved {n} handles before dispatch");
```

---

### Magic-number detection

Lightweight content sniffing backed by `infer`. Gated by the default-on `content-detect` Cargo feature -- disabling the feature drops the `infer` dependency entirely (the functions remain but return `(ContentKind::Other, None)`).

```rust
pub fn detect_from_bytes(bytes: &[u8]) -> (ContentKind, Option<String>);

#[cfg(not(target_arch = "wasm32"))]
pub fn detect_from_path(path: &std::path::Path) -> (ContentKind, Option<String>);

pub fn detect(
    bytes: Option<&[u8]>,
    mime_hint: Option<&str>,
    filename: Option<&str>,
) -> (ContentKind, Option<String>);
```

| Function | Description |
|----------|-------------|
| `detect_from_bytes` | Sniff the leading bytes and return the inferred [`ContentKind`](#contentkind) plus the matching MIME string if any. |
| `detect_from_path` | Native-only. Reads the head of the file, then falls back to the extension when bytes are inconclusive. |
| `detect` | Combined entry point: prefers byte sniffing when `bytes` is `Some`, then `mime_hint`, then `filename` extension. Returns `(ContentKind::Other, None)` when nothing matches. |

```rust
use blazen_llm::content::{detect, detect_from_bytes};

let (kind, mime) = detect_from_bytes(&[0x89, b'P', b'N', b'G', 0x0d, 0x0a, 0x1a, 0x0a]);
assert_eq!(mime.as_deref(), Some("image/png"));

let (kind2, mime2) = detect(None, Some("application/pdf"), Some("report.pdf"));
```

---

### `CompletionRequest::resolve_handles_with`

The lower-level half of [`prepare_request_with_store`](#visibility-helpers). Walks every message in the request, finds every [`ImageSource::Handle`](#imagesource), and replaces it with the concrete [`MediaSource`](#imagesource) returned by `store.resolve`. Does **not** inject a system note -- prefer `prepare_request_with_store` from the agent loop unless you specifically want to skip the note.

```rust
impl CompletionRequest {
    pub async fn resolve_handles_with(
        &mut self,
        store: &dyn ContentStore,
    ) -> Result<usize, BlazenError>;
}
```

Returns the number of `Handle` variants that were rewritten. Errors propagate from `store.resolve`.

```rust
let resolved = request.resolve_handles_with(store.as_ref()).await?;
println!("rewrote {resolved} handles in place");
```

---

## Agent System

The agent system implements the standard LLM + tool calling loop: send messages with tool definitions, execute any tool calls the model makes, feed results back, and repeat until the model stops or `max_iterations` is reached.

### `run_agent()`

Run the agent loop without event callbacks.

```rust
pub async fn run_agent(
    model: &dyn CompletionModel,
    messages: Vec<ChatMessage>,
    config: AgentConfig,
) -> Result<AgentResult, BlazenError>
```

### `run_agent_with_callback()`

Run the agent loop, emitting `AgentEvent`s to the supplied callback.

```rust
pub async fn run_agent_with_callback(
    model: &dyn CompletionModel,
    messages: Vec<ChatMessage>,
    config: AgentConfig,
    on_event: impl Fn(AgentEvent) + Send + Sync,
) -> Result<AgentResult, BlazenError>
```

**The loop works as follows:**

1. Build a `CompletionRequest` with the full message history and all tool definitions.
2. Call the model.
3. If the model responds with no tool calls, return immediately.
4. If the model invoked the built-in "finish" tool (when enabled), extract the answer and return.
5. Otherwise, execute each tool call, append results to messages, go back to step 1.
6. If `max_iterations` is reached, make one final call **without** tools to force a text answer.

**Usage:**

```rust
use std::sync::Arc;
use blazen_llm::{run_agent, AgentConfig, ChatMessage};

let config = AgentConfig::new(vec![Arc::new(WeatherTool)])
    .with_system_prompt("You are a helpful assistant with weather tools.")
    .with_max_iterations(5)
    .with_finish_tool()
    .with_temperature(0.7)
    .with_max_tokens(2048);

let result = run_agent(
    &model,
    vec![ChatMessage::user("What's the weather in Paris?")],
    config,
).await?;

println!("Answer: {}", result.response.content.unwrap_or_default());
println!("Iterations: {}", result.iterations);
println!("Total cost: ${:.4}", result.total_cost.unwrap_or(0.0));
```

**With callback:**

```rust
use blazen_llm::{run_agent_with_callback, AgentEvent};

let result = run_agent_with_callback(
    &model,
    vec![ChatMessage::user("What's the weather?")],
    config,
    |event| match &event {
        AgentEvent::ToolCalled { iteration, tool_call } => {
            println!("[iter {iteration}] calling tool: {}", tool_call.name);
        }
        AgentEvent::ToolResult { tool_name, result, .. } => {
            println!("  {tool_name} -> {result}");
        }
        AgentEvent::IterationComplete { iteration, had_tool_calls } => {
            println!("[iter {iteration}] done (tools: {had_tool_calls})");
        }
    },
).await?;
```

---

### `AgentConfig`

Configuration for the agentic tool execution loop.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `max_iterations` | `u32` | `10` | Maximum tool call rounds before forcing a stop |
| `tools` | `Vec<Arc<dyn Tool>>` | required | Tools available to the agent |
| `add_finish_tool` | `bool` | `false` | Add an implicit "finish" tool the model can call to exit early |
| `system_prompt` | `Option<String>` | `None` | System prompt prepended to messages |
| `temperature` | `Option<f32>` | `None` | Sampling temperature |
| `max_tokens` | `Option<u32>` | `None` | Maximum tokens per completion call |

**Builder pattern:**

```rust
AgentConfig::new(tools)
    .with_max_iterations(5)
    .with_system_prompt("You are helpful.")
    .with_finish_tool()
    .with_temperature(0.7)
    .with_max_tokens(2048)
```

---

### `AgentResult`

Result of an agent run.

| Field | Type | Description |
|-------|------|-------------|
| `response` | `CompletionResponse` | The final completion response |
| `messages` | `Vec<ChatMessage>` | Full message history including all tool calls and results |
| `iterations` | `u32` | Number of tool call rounds that occurred |
| `total_usage` | `Option<TokenUsage>` | Aggregated token usage across all rounds |
| `total_cost` | `Option<f64>` | Aggregated cost across all rounds |
| `timing` | `Option<RequestTiming>` | Total wall-clock time for the entire agent run |

---

### `AgentEvent`

Events emitted during agent execution (passed to the callback in `run_agent_with_callback`).

```rust
pub enum AgentEvent {
    ToolCalled {
        iteration: u32,
        tool_call: ToolCall,
    },
    ToolResult {
        iteration: u32,
        tool_name: String,
        result: serde_json::Value,
    },
    IterationComplete {
        iteration: u32,
        had_tool_calls: bool,
    },
}
```

---

## Context

The `Context` object is a shared key-value store available in every workflow step. It provides three storage tiers and methods for event routing, streaming, and state management.

### State Storage

#### Typed JSON: `set()` / `get()`

Store and retrieve any `Serialize` / `DeserializeOwned` type. Values are held internally as `StateValue::Json`.

```rust
// Store a typed value (anything implementing Serialize)
ctx.set("user_id", serde_json::json!("user_123"));
ctx.set("doc_count", serde_json::json!(5));

// Retrieve with type inference
let user_id: String = serde_json::from_value(ctx.get("user_id").unwrap()).unwrap();
let doc_count: i64 = serde_json::from_value(ctx.get("doc_count").unwrap()).unwrap();
```

#### Binary: `set_bytes()` / `get_bytes()`

Store raw `Vec<u8>` data. Values are held as `StateValue::Bytes`. No serialization requirement -- useful for model weights, protobuf, bincode, or any binary format.

```rust
ctx.set_bytes("weights", vec![0x01, 0x02, 0x03]);
let bytes: Vec<u8> = ctx.get_bytes("weights").unwrap();
```

#### Raw StateValue: `set_value()` / `get_value()`

Work with the `StateValue` enum directly for full control over the storage variant, including the `Native` variant used by language bindings.

```rust
use blazen::context::StateValue;

ctx.set_value("config", StateValue::Json(serde_json::json!({"retries": 3})));
ctx.set_value("blob", StateValue::Bytes(vec![0xDE, 0xAD].into()));
ctx.set_value("py_obj", StateValue::Native(pickle_bytes.into()));

match ctx.get_value("config") {
    Some(StateValue::Json(v)) => { /* structured data */ }
    Some(StateValue::Bytes(b)) => { /* raw bytes */ }
    Some(StateValue::Native(b)) => { /* platform-serialized opaque bytes */ }
    None => { /* key not found */ }
}
```

### `StateValue`

```rust
pub enum StateValue {
    Json(serde_json::Value),
    Bytes(BytesWrapper),
    Native(BytesWrapper),
}
```

| Variant | Description |
|---------|-------------|
| `Json(serde_json::Value)` | Structured, serializable data. Used by `ctx.set()` / `ctx.get()`. |
| `Bytes(BytesWrapper)` | Raw binary data. Used by `ctx.set_bytes()` / `ctx.get_bytes()`. |
| `Native(BytesWrapper)` | Platform-serialized opaque objects (e.g., Python pickle bytes). Preserved across language boundaries without deserialization. |

### Run Identity

```rust
ctx.run_id() -> &str
```

Returns the unique identifier for the current workflow run.

### Event Routing

```rust
ctx.send_event(event: impl Event)
```

Programmatically route an event into the workflow. Use this when a step needs to emit multiple events or decide at runtime which path to take. When using `send_event`, the step returns `()` instead of an event type.

```rust
ctx.write_event_to_stream(event: impl Event)
```

Publish an event to the workflow's external event stream, observable by callers via `stream_events()`. Useful for progress reporting and live updates.

### Session References

```rust
async fn session_refs_arc(&self) -> Arc<SessionRefRegistry>
async fn clear_session_refs(&self) -> usize
async fn session_pause_policy(&self) -> SessionPausePolicy
```

| Method | Description |
|--------|-------------|
| `session_refs_arc()` | Get a clone of the session-ref registry handle for use by language bindings. Bindings install it as a task-local for the duration of a step so platform-native objects (`Py<PyAny>`, `napi::Ref<JsObject>`, etc.) carried via event payloads can be resolved by UUID. |
| `clear_session_refs()` | Drain the session-ref registry. Called on workflow termination by the language bindings to release platform-specific live refs back to their respective garbage collectors. Returns the number of entries removed. |
| `session_pause_policy()` | Get the configured [`SessionPausePolicy`](#sessionpausepolicy). The policy is set by [`WorkflowBuilder::session_pause_policy`](#workflowbuildersession_pause_policy); there is no public setter on `Context`. |

See the dedicated [Session Reference Registry](#session-reference-registry) section for background, key types, and the pause-time policy matrix.

### State Snapshot and Restore

```rust
ctx.collect_events() -> Vec<Box<dyn Event>>
ctx.snapshot_state() -> ContextSnapshot
ctx.restore_state(snapshot: ContextSnapshot)
```

| Method | Description |
|--------|-------------|
| `collect_events()` | Drain all pending events from the context. |
| `snapshot_state()` | Capture the entire context state as a serializable snapshot (for checkpointing / pause-resume). |
| `restore_state(snapshot)` | Restore context from a previously captured snapshot. |

:::caution[Session refs are not snapshotted]
`Context::snapshot_state` intentionally excludes both the opaque `objects` map and the `session_refs` registry. Live in-process references cannot survive a cross-process snapshot round-trip. If your workflow may `pause()` and your bindings use session refs, configure [`WorkflowBuilder::session_pause_policy`](#workflowbuildersession_pause_policy) to control what happens (default: `PickleOrError`).
:::

---

## BlazenState

`BlazenState` is a binding-layer concept for Python, Node.js, and WASM. In those languages, extending a `BlazenState` base class gives you automatic per-field persistence in the workflow context. Rust has no equivalent base class.

In Rust, per-field storage is achieved manually by calling `set_value()` / `get_value()` with the `StateValue` enum (see the [StateValue](#statevalue) section above). Each field is stored under an explicit key, giving you full control over serialization format and storage variant.

The `Native(BytesWrapper)` variant exists specifically to support bindings: it lets platform-serialized objects (e.g., Python pickle bytes, Node.js `v8.serialize` output) round-trip through Rust steps without deserialization. Binding authors use `StateValue::Native` to store opaque platform objects, and Rust code can forward those values without interpreting their contents.

---

## Session Reference Registry

`blazen_core::session_ref` provides a per-`Context` registry of **live in-process references** — values that cannot or should not be JSON-serialized, such as DB connections, file handles, large in-memory tensors, lambdas, locks, or platform-native objects like `Py<PyAny>` and `napi::Ref<JsObject>`.

Each `Context` owns its own `SessionRefRegistry` with a lifetime tied to the workflow run. Event payloads carry only a JSON marker containing the key; the actual object lives in the registry until workflow completion. Bindings detect the marker and resolve it through the active registry to preserve object identity across step boundaries without serialisation.

The JSON marker format is:

```json
{"__blazen_session_ref__": "<uuid>"}
```

The tag string is exposed as a constant:

```rust
pub const SESSION_REF_TAG: &str = "__blazen_session_ref__";
```

A defensive cap protects against runaway loops exhausting memory:

```rust
pub const MAX_SESSION_REFS_PER_RUN: usize = 10_000;
```

`insert_arc` / `insert` return [`SessionRefError::CapacityExceeded`](#sessionreferror) once the registry reaches this cap.

### `RegistryKey`

Strongly-typed wrapper around `Uuid` used as the registry key.

```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Serialize, Deserialize)]
#[serde(transparent)]
pub struct RegistryKey(pub Uuid);
```

| Method | Signature | Description |
|--------|-----------|-------------|
| `new()` | `fn() -> Self` | Mint a fresh random key. |
| `parse(s)` | `fn(&str) -> Result<Self, uuid::Error>` | Parse a key from a UUID string. |
| `Display` | -- | Formats as the underlying UUID. |

### `SessionRefRegistry`

Per-context registry of live session references. Internally `Arc<dyn Any + Send + Sync>` keyed by `RegistryKey` and guarded by a `tokio::sync::RwLock`.

```rust
pub struct SessionRefRegistry { /* ... */ }
```

| Method | Signature | Description |
|--------|-----------|-------------|
| `new()` | `fn() -> Self` | Create an empty registry. |
| `insert_arc()` | `async fn(&self, Arc<dyn Any + Send + Sync>) -> Result<RegistryKey, SessionRefError>` | Insert a type-erased `Arc` directly. Returns the freshly minted key or `CapacityExceeded`. |
| `insert::<T>()` | `async fn(&self, T) -> Result<RegistryKey, SessionRefError>` | Insert any `Any + Send + Sync + 'static` value, wrapping it in an `Arc` for you. |
| `get_any()` | `async fn(&self, RegistryKey) -> Option<Arc<dyn Any + Send + Sync>>` | Look up the type-erased entry. Bindings call this and downcast. |
| `get::<T>()` | `async fn(&self, RegistryKey) -> Option<Arc<T>>` | Look up and downcast to a concrete `Arc<T>`. |
| `remove()` | `async fn(&self, RegistryKey) -> Option<Arc<dyn Any + Send + Sync>>` | Remove a single entry, returning the removed value if present. |
| `drain()` | `async fn(&self) -> usize` | Drain all entries, returning the number removed. |
| `len()` | `async fn(&self) -> usize` | Number of currently live entries. |
| `is_empty()` | `async fn(&self) -> bool` | Whether the registry has any live entries. |
| `keys()` | `async fn(&self) -> Vec<RegistryKey>` | Iterate every key currently in the registry. Used by the snapshot walker to apply [`SessionPausePolicy`](#sessionpausepolicy) uniformly. |

### `SessionPausePolicy`

Controls what happens to live session references when a workflow is paused or snapshotted.

```rust
#[derive(Debug, Clone, Copy, Default, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum SessionPausePolicy {
    #[default]
    PickleOrError,
    WarnDrop,
    HardError,
}
```

| Variant | Behaviour |
|---------|-----------|
| `PickleOrError` (**default**) | Attempt to pickle each live ref into the snapshot. On any failure, raise [`WorkflowError::SessionRefsNotSerializable`](#workflowerror-variants) and abort the snapshot. Recommended. |
| `WarnDrop` | Drop live refs from the snapshot, emit `tracing::warn!` per drop, and store a diagnostic report in snapshot metadata. On resume, accessing a dropped field raises a clear runtime error from the binding. |
| `HardError` | Refuse to pause if any live refs are in flight. Raises `WorkflowError::SessionRefsNotSerializable` immediately. |

### `SessionRefError`

Error type for registry operations.

```rust
#[derive(Debug, thiserror::Error)]
pub enum SessionRefError {
    #[error("session ref registry capacity exceeded ({cap} entries) — too many live references in this workflow run")]
    CapacityExceeded { cap: usize },
}
```

| Variant | Description |
|---------|-------------|
| `CapacityExceeded { cap: usize }` | Returned when `SessionRefRegistry::insert_arc` is called while the registry already holds `MAX_SESSION_REFS_PER_RUN` entries. |

### Snapshot exclusion

Session-ref entries are **deliberately excluded from `Context::snapshot_state()`**, mirroring the existing `objects` exclusion. Live in-process references cannot meaningfully round-trip through a serialized snapshot, so the snapshot walker applies [`SessionPausePolicy`](#sessionpausepolicy) at pause time instead. See [State Snapshot and Restore](#state-snapshot-and-restore) for the callout.

---

## Workflow

### `WorkflowBuilder`

The builder exposes a fluent API for configuring a workflow before `build()`. The full set of builder methods is documented in the guides; the entry below covers the session-ref configuration knob added alongside the session reference registry.

#### `WorkflowBuilder::session_pause_policy`

```rust
pub fn session_pause_policy(mut self, policy: SessionPausePolicy) -> Self
```

Configures the policy applied to live session references when the workflow is paused or snapshotted. Defaults to `PickleOrError`. See [`SessionPausePolicy`](#sessionpausepolicy) for the full variant matrix.

**Usage:**

```rust
use blazen_core::{WorkflowBuilder, session_ref::SessionPausePolicy};

let workflow = WorkflowBuilder::new("my-workflow")
    .step(my_step)
    .session_pause_policy(SessionPausePolicy::WarnDrop)
    .build()?;
```

### `WorkflowHandler`

The handle returned after starting a workflow. Provides await/stream/pause modes for consuming workflow results.

#### `WorkflowHandler::session_refs`

```rust
#[must_use]
pub fn session_refs(&self) -> Arc<SessionRefRegistry>
```

Returns a clone of the session-ref registry handle. Bindings call this after [`result()`](#workflowhandler) to resolve any `__blazen_session_ref__` markers carried by the final event, ensuring identity-preserving access to live Python / JS objects passed via event payloads.

The returned `Arc` keeps the registry alive past the event loop's exit so the final result event can still resolve live-ref markers even after the original `Context` has been dropped.

#### `WorkflowHandler::result`

```rust
pub async fn result(&self) -> Result<WorkflowResult>
```

Await the final workflow result. The returned `WorkflowResult` has an `.event` field containing the final event (typically a `StopEvent`). Use `result.event.downcast_ref::<StopEvent>()` to access typed data, or `result.event.to_json()` for a JSON representation.

#### `WorkflowHandler::pause`

```rust
pub fn pause(&self) -> Result<()>
```

Signal the workflow to pause. This is a synchronous, non-consuming call -- it does not return a snapshot. After pausing, call [`snapshot()`](#workflowhandlersnapshot) to obtain the serialized state.

#### `WorkflowHandler::snapshot`

```rust
pub async fn snapshot(&self) -> Result<String>
```

Capture the workflow's current state as a JSON string. Typically called after [`pause()`](#workflowhandlerpause). The snapshot can be persisted and later passed to `Workflow::resume()`.

#### `WorkflowHandler::resume_in_place`

```rust
pub fn resume_in_place(&self)
```

Resume a paused workflow in-place, continuing execution from where it left off.

#### `WorkflowHandler::respond_to_input`

```rust
pub fn respond_to_input(&self, request_id: String, response: serde_json::Value)
```

Supply a response to a pending `InputRequestEvent`. The `request_id` must match the ID from the original request. The workflow will route the response to the appropriate step and continue.

#### `WorkflowHandler::abort`

```rust
pub async fn abort(&self) -> Result<()>
```

Abort the running workflow. Any pending steps are cancelled and the workflow terminates with an error.

### `WorkflowError` variants

`blazen_core::WorkflowError` is the unified workflow error type. The session-ref subsystem introduces one new variant; other variants (e.g. `Paused`, `InputRequired`, `Other`) are documented in the workflow guides.

#### `SessionRefsNotSerializable`

```rust
#[error("session refs cannot be serialized for snapshot: {keys:?}")]
SessionRefsNotSerializable {
    /// String-formatted UUIDs of the live session refs that could not
    /// be persisted.
    keys: Vec<String>,
}
```

One or more live session references could not be serialized for a snapshot. The `keys` vector contains the string-formatted UUIDs of the offending entries. Produced by the default `PickleOrError` pause policy when a live ref is not picklable, and by `HardError` whenever any live refs are in flight at pause time.

---

## Compute Platform

The compute module provides a unified trait system for async, job-based media generation providers (fal.ai, Replicate, RunPod, etc.) that model a submit-poll-retrieve workflow for GPU workloads.

### `ComputeProvider`

The base trait for compute providers.

```rust
#[async_trait]
pub trait ComputeProvider: Send + Sync {
    fn provider_id(&self) -> &str;

    async fn submit(&self, request: ComputeRequest) -> Result<JobHandle, BlazenError>;

    async fn status(&self, job: &JobHandle) -> Result<JobStatus, BlazenError>;

    async fn result(&self, job: JobHandle) -> Result<ComputeResult, BlazenError>;

    async fn cancel(&self, job: &JobHandle) -> Result<(), BlazenError>;

    // Default: submit then wait for result
    async fn run(&self, request: ComputeRequest) -> Result<ComputeResult, BlazenError> {
        let job = self.submit(request).await?;
        self.result(job).await
    }
}
```

---

### `ImageGeneration`

Image generation and upscaling. Requires `ComputeProvider` as a supertrait.

```rust
#[async_trait]
pub trait ImageGeneration: ComputeProvider {
    async fn generate_image(&self, request: ImageRequest) -> Result<ImageResult, BlazenError>;
    async fn upscale_image(&self, request: UpscaleRequest) -> Result<ImageResult, BlazenError>;
}
```

**Usage:**

```rust
use blazen_llm::compute::{ImageGeneration, ImageRequest};

let result = provider.generate_image(
    ImageRequest::new("a cat in space")
        .with_size(1024, 1024)
        .with_count(2)
        .with_negative_prompt("blurry")
        .with_model("flux-dev"),
).await?;

for image in &result.images {
    println!("url: {:?}, {}x{}", image.media.url, image.width.unwrap_or(0), image.height.unwrap_or(0));
}
```

---

### `VideoGeneration`

Video synthesis from text or images. Requires `ComputeProvider` as a supertrait.

```rust
#[async_trait]
pub trait VideoGeneration: ComputeProvider {
    async fn text_to_video(&self, request: VideoRequest) -> Result<VideoResult, BlazenError>;
    async fn image_to_video(&self, request: VideoRequest) -> Result<VideoResult, BlazenError>;
}
```

**Usage:**

```rust
use blazen_llm::compute::{VideoGeneration, VideoRequest};

// Text-to-video
let result = provider.text_to_video(
    VideoRequest::new("a sunset timelapse")
        .with_duration(5.0)
        .with_size(1920, 1080)
        .with_model("kling"),
).await?;

// Image-to-video
let result = provider.image_to_video(
    VideoRequest::for_image("https://example.com/img.png", "animate this scene")
        .with_duration(3.0),
).await?;
```

---

### `AudioGeneration`

Audio synthesis including TTS, music, and sound effects. Requires `ComputeProvider` as a supertrait.

```rust
#[async_trait]
pub trait AudioGeneration: ComputeProvider {
    async fn text_to_speech(&self, request: SpeechRequest) -> Result<AudioResult, BlazenError>;

    // Default: returns BlazenError::Unsupported
    async fn generate_music(&self, request: MusicRequest) -> Result<AudioResult, BlazenError>;

    // Default: returns BlazenError::Unsupported
    async fn generate_sfx(&self, request: MusicRequest) -> Result<AudioResult, BlazenError>;
}
```

`generate_music()` and `generate_sfx()` have default implementations that return `BlazenError::Unsupported`. Providers override only the methods they support.

**Usage:**

```rust
use blazen_llm::compute::{AudioGeneration, SpeechRequest, MusicRequest};

let speech = provider.text_to_speech(
    SpeechRequest::new("Hello world")
        .with_voice("alloy")
        .with_language("en")
        .with_speed(1.0)
        .with_voice_url("https://example.com/voice.wav") // voice cloning
        .with_model("tts-1"),
).await?;

let music = provider.generate_music(
    MusicRequest::new("upbeat jazz")
        .with_duration(30.0)
        .with_model("musicgen"),
).await?;
```

---

### `Transcription`

Audio transcription (speech-to-text). Requires `ComputeProvider` as a supertrait.

```rust
#[async_trait]
pub trait Transcription: ComputeProvider {
    async fn transcribe(
        &self,
        request: TranscriptionRequest,
    ) -> Result<TranscriptionResult, BlazenError>;
}
```

**Usage:**

```rust
use blazen_llm::compute::{Transcription, TranscriptionRequest};

let result = provider.transcribe(
    TranscriptionRequest::new("https://example.com/audio.mp3")
        .with_language("en")
        .with_diarize(true)
        .with_model("whisper-v3"),
).await?;

println!("Full text: {}", result.text);
for segment in &result.segments {
    println!("[{:.1}s - {:.1}s] {}: {}",
        segment.start, segment.end,
        segment.speaker.as_deref().unwrap_or("?"),
        segment.text,
    );
}
```

---

### `ThreeDGeneration`

3D model generation from text or images. Requires `ComputeProvider` as a supertrait.

```rust
#[async_trait]
pub trait ThreeDGeneration: ComputeProvider {
    async fn generate_3d(&self, request: ThreeDRequest) -> Result<ThreeDResult, BlazenError>;
}
```

**Usage:**

```rust
use blazen_llm::compute::{ThreeDGeneration, ThreeDRequest};

// Text-to-3D
let result = provider.generate_3d(
    ThreeDRequest::new("a 3D cat")
        .with_format("glb")
        .with_model("triposr"),
).await?;

// Image-to-3D
let result = provider.generate_3d(
    ThreeDRequest::from_image("https://example.com/cat.png")
        .with_format("obj"),
).await?;

for model_3d in &result.models {
    println!("vertices: {:?}, faces: {:?}, textures: {}, animations: {}",
        model_3d.vertex_count, model_3d.face_count,
        model_3d.has_textures, model_3d.has_animations,
    );
}
```

---

### Compute Request Types

#### `ImageRequest`

| Field | Type | Description |
|-------|------|-------------|
| `prompt` | `String` | Text prompt describing the desired image |
| `negative_prompt` | `Option<String>` | Things to avoid in the image |
| `width` | `Option<u32>` | Desired width in pixels |
| `height` | `Option<u32>` | Desired height in pixels |
| `num_images` | `Option<u32>` | Number of images to generate |
| `model` | `Option<String>` | Model override |
| `parameters` | `serde_json::Value` | Additional provider-specific parameters |

Builder: `ImageRequest::new(prompt).with_size(w, h).with_count(n).with_negative_prompt(p).with_model(m)`

#### `UpscaleRequest`

| Field | Type | Description |
|-------|------|-------------|
| `image_url` | `String` | URL of the image to upscale |
| `scale` | `f32` | Scale factor (e.g. 2.0, 4.0) |
| `model` | `Option<String>` | Model override |
| `parameters` | `serde_json::Value` | Additional provider-specific parameters |

Builder: `UpscaleRequest::new(url, scale).with_model(m)`

#### `VideoRequest`

| Field | Type | Description |
|-------|------|-------------|
| `prompt` | `String` | Text prompt |
| `image_url` | `Option<String>` | Source image for image-to-video |
| `duration_seconds` | `Option<f32>` | Desired duration in seconds |
| `negative_prompt` | `Option<String>` | Things to avoid |
| `width` | `Option<u32>` | Desired width in pixels |
| `height` | `Option<u32>` | Desired height in pixels |
| `model` | `Option<String>` | Model override |
| `parameters` | `serde_json::Value` | Additional provider-specific parameters |

Builder: `VideoRequest::new(prompt)` or `VideoRequest::for_image(url, prompt)`, then `.with_duration(s).with_size(w, h).with_model(m)`

#### `SpeechRequest`

| Field | Type | Description |
|-------|------|-------------|
| `text` | `String` | Text to synthesize |
| `voice` | `Option<String>` | Voice identifier (provider-specific) |
| `voice_url` | `Option<String>` | Reference voice URL for voice cloning |
| `language` | `Option<String>` | Language code (e.g. `"en"`, `"fr"`) |
| `speed` | `Option<f32>` | Speed multiplier (1.0 = normal) |
| `model` | `Option<String>` | Model override |
| `parameters` | `serde_json::Value` | Additional provider-specific parameters |

Builder: `SpeechRequest::new(text).with_voice(v).with_voice_url(url).with_language(l).with_speed(s).with_model(m)`

#### `MusicRequest`

| Field | Type | Description |
|-------|------|-------------|
| `prompt` | `String` | Text prompt |
| `duration_seconds` | `Option<f32>` | Desired duration in seconds |
| `model` | `Option<String>` | Model override |
| `parameters` | `serde_json::Value` | Additional provider-specific parameters |

Builder: `MusicRequest::new(prompt).with_duration(s).with_model(m)`

#### `TranscriptionRequest`

| Field | Type | Description |
|-------|------|-------------|
| `audio_url` | `String` | URL of the audio file |
| `language` | `Option<String>` | Language hint |
| `diarize` | `bool` | Enable speaker diarization (default: `false`) |
| `model` | `Option<String>` | Model override |
| `parameters` | `serde_json::Value` | Additional provider-specific parameters |

Builder: `TranscriptionRequest::new(url).with_language(l).with_diarize(true).with_model(m)`

#### `ThreeDRequest`

| Field | Type | Description |
|-------|------|-------------|
| `prompt` | `String` | Text prompt |
| `image_url` | `Option<String>` | Source image for image-to-3D |
| `format` | `Option<String>` | Output format (e.g. `"glb"`, `"obj"`, `"usdz"`) |
| `model` | `Option<String>` | Model override |
| `parameters` | `serde_json::Value` | Additional provider-specific parameters |

Builder: `ThreeDRequest::new(prompt)` or `ThreeDRequest::from_image(url)`, then `.with_format(f).with_model(m)`

---

### Compute Result Types

#### `ImageResult`

| Field | Type | Description |
|-------|------|-------------|
| `images` | `Vec<GeneratedImage>` | The generated/upscaled images |
| `timing` | `RequestTiming` | Request timing breakdown |
| `cost` | `Option<f64>` | Cost in USD |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

#### `VideoResult`

| Field | Type | Description |
|-------|------|-------------|
| `videos` | `Vec<GeneratedVideo>` | The generated videos |
| `timing` | `RequestTiming` | Request timing breakdown |
| `cost` | `Option<f64>` | Cost in USD |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

#### `AudioResult`

| Field | Type | Description |
|-------|------|-------------|
| `audio` | `Vec<GeneratedAudio>` | The generated audio clips |
| `timing` | `RequestTiming` | Request timing breakdown |
| `cost` | `Option<f64>` | Cost in USD |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

#### `ThreeDResult`

| Field | Type | Description |
|-------|------|-------------|
| `models` | `Vec<Generated3DModel>` | The generated 3D models |
| `timing` | `RequestTiming` | Request timing breakdown |
| `cost` | `Option<f64>` | Cost in USD |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

#### `TranscriptionResult`

| Field | Type | Description |
|-------|------|-------------|
| `text` | `String` | Full transcribed text |
| `segments` | `Vec<TranscriptionSegment>` | Time-aligned segments |
| `language` | `Option<String>` | Detected/specified language code |
| `timing` | `RequestTiming` | Request timing breakdown |
| `cost` | `Option<f64>` | Cost in USD |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

#### `TranscriptionSegment`

| Field | Type | Description |
|-------|------|-------------|
| `text` | `String` | Transcribed text for this segment |
| `start` | `f64` | Start time in seconds |
| `end` | `f64` | End time in seconds |
| `speaker` | `Option<String>` | Speaker label (if diarization was enabled) |

---

### Compute Job Types

#### `ComputeRequest`

| Field | Type | Description |
|-------|------|-------------|
| `model` | `String` | Model/endpoint to run (e.g. `"fal-ai/flux/dev"`) |
| `input` | `serde_json::Value` | Input parameters as JSON (model-specific) |
| `webhook` | `Option<String>` | Webhook URL for async completion notification |

#### `ComputeResult`

| Field | Type | Description |
|-------|------|-------------|
| `job` | `Option<JobHandle>` | The job handle that produced this result |
| `output` | `serde_json::Value` | Output data (model-specific JSON) |
| `timing` | `RequestTiming` | Request timing breakdown |
| `cost` | `Option<f64>` | Cost in USD |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

#### `JobHandle`

| Field | Type | Description |
|-------|------|-------------|
| `id` | `String` | Provider-assigned job identifier |
| `provider` | `String` | Provider name (e.g. `"fal"`) |
| `model` | `String` | Model/endpoint that was invoked |
| `submitted_at` | `DateTime<Utc>` | When the job was submitted |

#### `JobStatus`

```rust
pub enum JobStatus {
    Queued,
    Running,
    Completed,
    Failed { error: String },
    Cancelled,
}
```

---

## Media

### `MediaType`

Exhaustive enumeration of media formats with detection support. Covers images, video, audio, 3D models, documents, and a catch-all `Other` variant.

**Variants:**

| Category | Variants |
|----------|----------|
| Image | `Png`, `Jpeg`, `WebP`, `Gif`, `Svg`, `Bmp`, `Tiff`, `Avif`, `Ico` |
| Video | `Mp4`, `WebM`, `Mov`, `Avi`, `Mkv` |
| Audio | `Mp3`, `Wav`, `Ogg`, `Flac`, `Aac`, `M4a`, `WebmAudio` |
| 3D | `Glb`, `Gltf`, `Obj`, `Fbx`, `Usdz`, `Stl`, `Ply` |
| Document | `Pdf` |
| Catch-all | `Other { mime: String }` |

**Methods:**

| Method | Signature | Description |
|--------|-----------|-------------|
| `mime()` | `&self -> &str` | Return the MIME type string |
| `extension()` | `&self -> &str` | Return the canonical file extension (no dot) |
| `magic_bytes()` | `&self -> Option<&'static [u8]>` | Return the magic byte signature, if any |
| `detect(bytes)` | `fn(&[u8]) -> Option<Self>` | Detect media type from file header bytes |
| `from_mime(mime)` | `fn(&str) -> Self` | Parse a MIME string (unknown = `Other`) |
| `from_extension(ext)` | `fn(&str) -> Self` | Parse a file extension (unknown = `Other`) |
| `is_image()` | `&self -> bool` | Is this an image format? |
| `is_video()` | `&self -> bool` | Is this a video format? |
| `is_audio()` | `&self -> bool` | Is this an audio format? |
| `is_3d()` | `&self -> bool` | Is this a 3D model format? |
| `is_vector()` | `&self -> bool` | Is this a text-based format (SVG, GLTF, OBJ)? |

`MediaType` implements `Display` (outputs the MIME string).

**Example:**

```rust
use blazen_llm::MediaType;

let mt = MediaType::from_extension("png");
assert_eq!(mt.mime(), "image/png");
assert!(mt.is_image());

// Detect from raw bytes
let bytes = [0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A];
assert_eq!(MediaType::detect(&bytes), Some(MediaType::Png));
```

---

### `MediaOutput`

A single piece of generated media content. At least one of `url`, `base64`, or `raw_content` will be populated.

| Field | Type | Description |
|-------|------|-------------|
| `url` | `Option<String>` | URL where the media can be downloaded |
| `base64` | `Option<String>` | Base64-encoded media data |
| `raw_content` | `Option<String>` | Raw text content (SVG, OBJ, GLTF JSON) |
| `media_type` | `MediaType` | Format of the media |
| `file_size` | `Option<u64>` | File size in bytes |
| `metadata` | `serde_json::Value` | Provider-specific metadata |

**Constructors:**

```rust
let output = MediaOutput::from_url("https://example.com/img.png", MediaType::Png);
let output = MediaOutput::from_base64("iVBORw0KGgo=", MediaType::Png);
```

---

### `GeneratedImage`

| Field | Type | Description |
|-------|------|-------------|
| `media` | `MediaOutput` | The image media output |
| `width` | `Option<u32>` | Width in pixels |
| `height` | `Option<u32>` | Height in pixels |

### `GeneratedVideo`

| Field | Type | Description |
|-------|------|-------------|
| `media` | `MediaOutput` | The video media output |
| `width` | `Option<u32>` | Width in pixels |
| `height` | `Option<u32>` | Height in pixels |
| `duration_seconds` | `Option<f32>` | Duration in seconds |
| `fps` | `Option<f32>` | Frames per second |

### `GeneratedAudio`

| Field | Type | Description |
|-------|------|-------------|
| `media` | `MediaOutput` | The audio media output |
| `duration_seconds` | `Option<f32>` | Duration in seconds |
| `sample_rate` | `Option<u32>` | Sample rate in Hz |
| `channels` | `Option<u8>` | Number of audio channels |

### `Generated3DModel`

| Field | Type | Description |
|-------|------|-------------|
| `media` | `MediaOutput` | The 3D model media output |
| `vertex_count` | `Option<u64>` | Total vertex count |
| `face_count` | `Option<u64>` | Total face/triangle count |
| `has_textures` | `bool` | Whether the model includes textures |
| `has_animations` | `bool` | Whether the model includes animations |

---

## LocalModel Trait

The `LocalModel` trait provides explicit load/unload lifecycle management for models running in-process (llama.cpp, whisper.cpp, etc.). Remote API providers do not implement this trait.

```rust
#[async_trait]
pub trait LocalModel: Send + Sync {
    async fn load(&self) -> Result<(), BlazenError>;
    async fn unload(&self) -> Result<(), BlazenError>;
    async fn is_loaded(&self) -> bool;
    async fn vram_bytes(&self) -> Option<u64>;
}
```

| Method | Description |
|--------|-------------|
| `load()` | Load the model into memory/VRAM. Idempotent. |
| `unload()` | Free the model's memory/VRAM. Idempotent. |
| `is_loaded()` | Whether the model is currently loaded. |
| `vram_bytes()` | Approximate memory footprint in bytes, or `None` if unknown. |

A type can implement both `CompletionModel` and `LocalModel`:

```rust
struct MyLocalLLM { /* ... */ }

#[async_trait::async_trait]
impl CompletionModel for MyLocalLLM {
    fn model_id(&self) -> &str { "my-local-llm" }
    async fn complete(&self, request: CompletionRequest) -> Result<CompletionResponse, BlazenError> {
        self.load().await?; // auto-load on first call
        // inference logic
        todo!()
    }
    async fn stream(&self, request: CompletionRequest) -> Result<Pin<Box<dyn Stream<Item = Result<StreamChunk, BlazenError>> + Send>>, BlazenError> {
        todo!()
    }
}

#[async_trait::async_trait]
impl LocalModel for MyLocalLLM {
    async fn load(&self) -> Result<(), BlazenError> { /* load weights */ Ok(()) }
    async fn unload(&self) -> Result<(), BlazenError> { /* free VRAM */ Ok(()) }
    async fn is_loaded(&self) -> bool { true }
    async fn vram_bytes(&self) -> Option<u64> { Some(4_000_000_000) }
}
```

---

## ModelManager

VRAM budget-aware model manager with LRU eviction. Tracks registered `LocalModel` instances and their estimated VRAM footprint. When loading a model that would exceed the budget, the least-recently-used loaded model is unloaded first.

```rust
use blazen_manager::ModelManager;

let manager = ModelManager::new(24 * 1_073_741_824); // 24 GiB
// or
let manager = ModelManager::with_budget_gb(24.0);
```

### Methods

| Method | Signature | Description |
|--------|-----------|-------------|
| `new` | `fn(budget_bytes: u64) -> Self` | Create a manager with a byte budget. |
| `with_budget_gb` | `fn(gb: f64) -> Self` | Create a manager with a GiB budget. |
| `register` | `async fn(&self, id, model: Arc<dyn LocalModel>, vram_estimate: u64)` | Register a model. Starts unloaded. |
| `load` | `async fn(&self, id: &str) -> Result<()>` | Load a model, evicting LRU models if needed. |
| `unload` | `async fn(&self, id: &str) -> Result<()>` | Unload a model and free its VRAM. |
| `is_loaded` | `async fn(&self, id: &str) -> bool` | Check if a model is currently loaded. |
| `ensure_loaded` | `async fn(&self, id: &str) -> Result<()>` | Alias for `load()`. |
| `used_bytes` | `async fn(&self) -> u64` | Total VRAM used by loaded models. |
| `available_bytes` | `async fn(&self) -> u64` | Available VRAM within the budget. |
| `status` | `async fn(&self) -> Vec<ModelStatus>` | Status of all registered models. |

### ModelStatus

| Field | Type | Description |
|-------|------|-------------|
| `id` | `String` | Model identifier. |
| `loaded` | `bool` | Whether the model is currently loaded. |
| `vram_estimate` | `u64` | Estimated VRAM footprint in bytes. |

**Usage:**

```rust
use std::sync::Arc;
use blazen_manager::ModelManager;

let manager = ModelManager::with_budget_gb(24.0);
manager.register("llama", Arc::new(my_llama_model), 8 * 1_073_741_824).await;
manager.register("whisper", Arc::new(my_whisper_model), 2 * 1_073_741_824).await;

manager.load("llama").await?;
assert!(manager.is_loaded("llama").await);
println!("Used: {} bytes", manager.used_bytes().await);
println!("Available: {} bytes", manager.available_bytes().await);

// Loading whisper may evict llama if budget is tight
manager.load("whisper").await?;
manager.unload("llama").await?;
```

---

## Pricing

The pricing module provides a global thread-safe registry of per-model pricing data, pre-seeded with defaults for well-known models.

### `PricingEntry`

| Field | Type | Description |
|-------|------|-------------|
| `input_per_million` | `f64` | Cost per million input tokens (USD). |
| `output_per_million` | `f64` | Cost per million output tokens (USD). |

### `register_pricing()`

Register or override pricing for a model. Model IDs are normalized before storage.

```rust
use blazen_llm::{register_pricing, PricingEntry};

register_pricing("my-model", PricingEntry {
    input_per_million: 1.0,
    output_per_million: 2.0,
});
```

### `lookup_pricing()`

Look up pricing for a model by ID. Returns `None` if the model is unknown.

```rust
use blazen_llm::lookup_pricing;

if let Some(entry) = lookup_pricing("gpt-4o") {
    println!("Input: ${}/M tokens", entry.input_per_million);
}
```

### `compute_cost()`

Compute the cost of a request given a model ID and token usage.

```rust
use blazen_llm::{compute_cost, TokenUsage};

let usage = TokenUsage { prompt_tokens: 1000, completion_tokens: 500, total_tokens: 1500 };
if let Some(cost) = compute_cost("gpt-4o", &usage) {
    println!("Cost: ${:.4}", cost);
}
```

---

## MemoryBackend Trait

Low-level storage backend used by `Memory`. Backends are responsible for persistence and band-based candidate retrieval. They do not perform embedding or ELID encoding.

```rust
#[async_trait]
pub trait MemoryBackend: Send + Sync {
    async fn put(&self, entry: StoredEntry) -> Result<()>;
    async fn get(&self, id: &str) -> Result<Option<StoredEntry>>;
    async fn delete(&self, id: &str) -> Result<bool>;
    async fn list(&self) -> Result<Vec<StoredEntry>>;
    async fn len(&self) -> Result<usize>;
    async fn is_empty(&self) -> Result<bool>; // default: self.len() == 0
    async fn search_by_bands(
        &self,
        bands: &[u64],
        limit: usize,
    ) -> Result<Vec<StoredEntry>>;
}
```

### Implementing a Custom Backend

```rust
use blazen_memory::store::{MemoryBackend, StoredEntry};
use anyhow::Result;

struct PostgresBackend {
    pool: sqlx::PgPool,
}

#[async_trait::async_trait]
impl MemoryBackend for PostgresBackend {
    async fn put(&self, entry: StoredEntry) -> Result<()> {
        // INSERT or UPDATE in Postgres
        todo!()
    }

    async fn get(&self, id: &str) -> Result<Option<StoredEntry>> {
        // SELECT by id
        todo!()
    }

    async fn delete(&self, id: &str) -> Result<bool> {
        // DELETE by id, return true if row existed
        todo!()
    }

    async fn list(&self) -> Result<Vec<StoredEntry>> {
        // SELECT all
        todo!()
    }

    async fn len(&self) -> Result<usize> {
        // SELECT COUNT(*)
        todo!()
    }

    async fn search_by_bands(
        &self,
        bands: &[u64],
        limit: usize,
    ) -> Result<Vec<StoredEntry>> {
        // Query entries sharing at least one band
        todo!()
    }
}
```

### Built-in Backends

| Backend | Description |
|---------|-------------|
| `InMemoryBackend` | In-process `HashMap` storage. Fast, no persistence. |
| `JsonlBackend` | Append-only JSONL file storage. |
| `ValkeyBackend` | Redis/Valkey-backed storage. |

---

## Error Handling

### `BlazenError`

The unified error type for all Blazen LLM and compute operations.

| Variant | Fields | Description |
|---------|--------|-------------|
| `Auth` | `message: String` | Authentication failed |
| `RateLimit` | `retry_after_ms: Option<u64>` | Rate limited by the provider |
| `Timeout` | `elapsed_ms: u64` | Request timed out |
| `Provider` | `provider: String, message: String, status_code: Option<u16>` | Provider-specific error |
| `Validation` | `field: Option<String>, message: String` | Invalid input |
| `ContentPolicy` | `message: String` | Content policy violation |
| `Unsupported` | `message: String` | Requested capability is not supported |
| `Serialization` | `String` | JSON serialization/deserialization error |
| `Request` | `message: String, source: Option<Box<dyn Error>>` | Network or request-level failure |
| `Completion` | `CompletionErrorKind` | LLM completion-specific error |
| `Compute` | `ComputeErrorKind` | Compute job-specific error |
| `Media` | `MediaErrorKind` | Media-specific error |
| `Tool` | `name: Option<String>, message: String` | Tool execution error |

### `CompletionErrorKind`

| Variant | Description |
|---------|-------------|
| `NoContent` | Model returned no content |
| `ModelNotFound(String)` | Model not found |
| `InvalidResponse(String)` | Invalid response from the model |
| `Stream(String)` | Streaming error |

### `ComputeErrorKind`

| Variant | Fields | Description |
|---------|--------|-------------|
| `JobFailed` | `message: String, error_type: Option<String>, retryable: bool` | Compute job failed |
| `Cancelled` | -- | Job was cancelled |
| `QuotaExceeded` | `message: String` | Provider quota exceeded |

### `MediaErrorKind`

| Variant | Fields | Description |
|---------|--------|-------------|
| `Invalid` | `media_type: Option<String>, message: String` | Invalid media |
| `TooLarge` | `size_bytes: u64, max_bytes: u64` | Media exceeds size limit |

### `is_retryable()`

```rust
impl BlazenError {
    pub fn is_retryable(&self) -> bool;
}
```

Returns `true` for `RateLimit`, `Timeout`, `Request`, provider errors with status >= 500, and `ComputeErrorKind::JobFailed` where `retryable` is true.

### Convenience Constructors

```rust
BlazenError::auth("invalid api key")
BlazenError::timeout(5000)
BlazenError::timeout_from_duration(elapsed)
BlazenError::request("connection reset")
BlazenError::unsupported("music generation not available")
BlazenError::provider("openai", "internal server error")
BlazenError::validation("prompt must not be empty")
BlazenError::tool_error("unknown tool: foo")
BlazenError::no_content()
BlazenError::model_not_found("gpt-5")
BlazenError::invalid_response("missing content field")
BlazenError::stream_error("unexpected EOF")
BlazenError::job_failed("GPU out of memory")
BlazenError::cancelled()
```

`BlazenError` also implements `From<serde_json::Error>` for automatic conversion.

---

## Custom Providers

### Implementing `CompletionModel`

```rust
use blazen_llm::{
    CompletionModel, CompletionRequest, CompletionResponse, StreamChunk, BlazenError,
};
use std::pin::Pin;
use futures_util::Stream;

struct MyProvider {
    api_key: String,
}

#[async_trait::async_trait]
impl CompletionModel for MyProvider {
    fn model_id(&self) -> &str {
        "my-custom-model"
    }

    async fn complete(
        &self,
        request: CompletionRequest,
    ) -> Result<CompletionResponse, BlazenError> {
        // Your HTTP/gRPC/local inference logic here
        todo!()
    }

    async fn stream(
        &self,
        request: CompletionRequest,
    ) -> Result<
        Pin<Box<dyn Stream<Item = Result<StreamChunk, BlazenError>> + Send>>,
        BlazenError,
    > {
        // Your streaming implementation here
        todo!()
    }
}
```

Once implemented, `MyProvider` automatically gets `StructuredOutput` via the blanket impl, so `model.extract::<T>(messages)` works out of the box.

### Implementing `ComputeProvider` + `ImageGeneration`

```rust
use blazen_llm::compute::*;
use blazen_llm::BlazenError;

struct MyImageProvider {
    api_key: String,
}

#[async_trait::async_trait]
impl ComputeProvider for MyImageProvider {
    fn provider_id(&self) -> &str { "my-image-provider" }

    async fn submit(&self, request: ComputeRequest) -> Result<JobHandle, BlazenError> {
        todo!()
    }

    async fn status(&self, job: &JobHandle) -> Result<JobStatus, BlazenError> {
        todo!()
    }

    async fn result(&self, job: JobHandle) -> Result<ComputeResult, BlazenError> {
        todo!()
    }

    async fn cancel(&self, job: &JobHandle) -> Result<(), BlazenError> {
        todo!()
    }
}

#[async_trait::async_trait]
impl ImageGeneration for MyImageProvider {
    async fn generate_image(
        &self,
        request: ImageRequest,
    ) -> Result<ImageResult, BlazenError> {
        // Convert ImageRequest to your provider's format and call the API
        todo!()
    }

    async fn upscale_image(
        &self,
        request: UpscaleRequest,
    ) -> Result<ImageResult, BlazenError> {
        todo!()
    }
}
```

---

## Built-in Providers

| Provider | Feature | Traits Implemented |
|----------|---------|-------------------|
| `OpenAiProvider` | `openai` | `CompletionModel`, `StructuredOutput` |
| `OpenAiCompatProvider` | `openai` | `CompletionModel`, `StructuredOutput`, `ModelRegistry` |
| `AnthropicProvider` | `anthropic` | `CompletionModel`, `StructuredOutput` |
| `GeminiProvider` | `gemini` | `CompletionModel`, `StructuredOutput`, `ModelRegistry` |
| `AzureOpenAiProvider` | `azure` | `CompletionModel`, `StructuredOutput` |
| `FalProvider` | `fal` | `CompletionModel`, `StructuredOutput`, `ComputeProvider`, `ImageGeneration`, `VideoGeneration`, `AudioGeneration`, `Transcription` |

### `OpenAiCompatProvider` Presets

`OpenAiCompatProvider` works with any OpenAI-compatible endpoint. Named constructors are provided for popular services:

```rust
use blazen_llm::providers::openai_compat::OpenAiCompatProvider;

let groq = OpenAiCompatProvider::groq("gsk-...");
let openrouter = OpenAiCompatProvider::openrouter("sk-or-...");
let together = OpenAiCompatProvider::together("...");
let mistral = OpenAiCompatProvider::mistral("...");
let deepseek = OpenAiCompatProvider::deepseek("...");
let fireworks = OpenAiCompatProvider::fireworks("...");
let perplexity = OpenAiCompatProvider::perplexity("...");
let xai = OpenAiCompatProvider::xai("...");
let cohere = OpenAiCompatProvider::cohere("...");
let bedrock = OpenAiCompatProvider::bedrock("...", "us-east-1");
```

---

## Telemetry Exporters

Re-exported from `blazen_telemetry` at the crate root and gated by the corresponding Cargo features (see [Feature Flags](#feature-flags)). All exporters return a `tracing_subscriber::Layer` (or install one globally) that is composed into a `tracing_subscriber::registry()`.

### `LangfuseConfig`

Builder-style configuration for the Langfuse exporter. Shipping `langfuse` pulls in `reqwest` for the native ingestion client; on wasm32 the dispatcher is a no-op (events are dropped) because Langfuse export is a native-target feature.

| Method | Signature | Description |
|--------|-----------|-------------|
| `new` | `fn new(public_key: impl Into<String>, secret_key: impl Into<String>) -> Self` | Construct with the required Langfuse public + secret keys. Defaults host to `https://cloud.langfuse.com`, batch size to `100`, flush interval to `5000` ms |
| `with_host` | `fn with_host(self, host: impl Into<String>) -> Self` | Override the Langfuse host URL (e.g. `https://eu.langfuse.com`) |
| `with_batch_size` | `fn with_batch_size(self, batch_size: usize) -> Self` | Maximum number of envelopes buffered before an automatic flush |
| `with_flush_interval_ms` | `fn with_flush_interval_ms(self, ms: u64) -> Self` | Background flush cadence in milliseconds |

`LangfuseConfig` derives `Debug`, `Clone`, `Serialize`, `Deserialize`, and `Default` so it can be loaded from configuration files. Public/secret-key fields are required; host, `batch_size`, and `flush_interval_ms` are populated with defaults during deserialization.

### `LangfuseLayer`

A `tracing_subscriber::Layer<S>` (where `S: Subscriber + for<'a> LookupSpan<'a>`) that maps Blazen spans to Langfuse ingestion events:

| Blazen span name | Langfuse concept | Ingestion event |
|------------------|------------------|-----------------|
| `workflow.run`, `pipeline.run` | Trace | `trace-create` |
| `workflow.step`, `pipeline.stage`, `pipeline.stage.sequential`, `pipeline.stage.parallel` | Span | `span-create` |
| `llm.complete`, `llm.stream` | Generation | `generation-create` |

Construct via [`init_langfuse`](#init_langfuse). The layer owns an unbounded `mpsc` sender into a background dispatcher that batches and POSTs to `{host}/api/public/ingestion` with HTTP basic auth (`public_key:secret_key`).

### `init_langfuse`

```rust
pub fn init_langfuse(config: LangfuseConfig) -> Result<LangfuseLayer, TelemetryError>;
```

Builds the HTTP client and spawns the background dispatcher on the current Tokio runtime, then returns a `LangfuseLayer` ready to compose into a subscriber registry. Returns `TelemetryError::Langfuse` if no Tokio runtime is available (native targets) or if the underlying `reqwest::Client` cannot be built.

```rust
use blazen_telemetry::{LangfuseConfig, init_langfuse};
use tracing_subscriber::prelude::*;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = LangfuseConfig::new("pk-lf-...", "sk-lf-...")
        .with_host("https://cloud.langfuse.com")
        .with_batch_size(50)
        .with_flush_interval_ms(2_500);

    let layer = init_langfuse(config)?;
    tracing_subscriber::registry().with(layer).init();
    Ok(())
}
```

### `OtlpConfig`

Shared configuration for both OTLP transports.

| Field | Type | Description |
|-------|------|-------------|
| `endpoint` | `String` | OTLP endpoint URL. For `otlp` (gRPC): `http://localhost:4317`. For `otlp-http`: `http://localhost:4318/v1/traces` |
| `service_name` | `String` | Reported as the `service.name` resource attribute on every span |

Derives `Debug`, `Clone`, `Serialize`, `Deserialize`.

### `init_otlp` (gRPC, `otlp` feature, native only)

```rust
pub fn init_otlp(config: OtlpConfig) -> Result<(), Box<dyn std::error::Error>>;
```

Builds a tonic-backed `SpanExporter`, installs it into a global `SdkTracerProvider`, and registers a `tracing_subscriber::registry()` with `EnvFilter`, the OTel layer, and a `fmt` layer in one call. Native targets only — `opentelemetry-otlp/grpc-tonic` does not compile to `wasm32-unknown-unknown`.

### `init_otlp_http` (HTTP/protobuf, `otlp-http` feature)

```rust
pub fn init_otlp_http(config: OtlpConfig) -> Result<(), Box<dyn std::error::Error>>;
```

Same shape as [`init_otlp`](#init_otlp-grpc-otlp-feature-native-only) but speaks OTLP/HTTP with the binary protobuf encoding. Works on **both** native and wasm32:

- **Native**: registers an internal `ReqwestHttpClient` (a thin `reqwest::Client` wrapper) via `with_http_client`.
- **wasm32**: registers `WasmFetchHttpClient`, a `web_sys::fetch`-backed `opentelemetry_http::HttpClient` impl that ships in the `blazen_telemetry::exporters::wasm_otlp_client` module.

The reason for the indirection: `opentelemetry-otlp`'s built-in `reqwest-client` / `reqwest-blocking-client` features compile a wasm32 `reqwest::Client` whose `send` future is `!Send`, which violates the `HttpClient: Send + Sync` bound and breaks the build. Pinning our own `HttpClient` impls per target keeps the trait satisfied without enabling those upstream features.

```rust
use blazen_telemetry::{OtlpConfig, init_otlp_http};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    init_otlp_http(OtlpConfig {
        endpoint: "http://localhost:4318/v1/traces".to_string(),
        service_name: "my-blazen-app".to_string(),
    })?;
    // ... run workflow; spans flush via OTLP/HTTP ...
    Ok(())
}
```

### `TelemetryError`

Returned by `init_langfuse`. Variants include `Langfuse(String)` (HTTP-client construction or runtime-lookup failure). Re-exported from `blazen_telemetry::TelemetryError`.

---

## WASM Embeddings (`blazen-embed-tract`)

The `blazen-embed-tract` crate ships **two** ONNX-runtime-backed embedding providers built on `tract`:

| Type | Target | Source of weights |
|------|--------|-------------------|
| `TractEmbedModel` | native | `hf-hub` (Hugging Face download cache, filesystem-backed) |
| `WasmTractEmbedModel` | `wasm32` only | `web_sys::fetch` (URLs supplied by the caller) |

The wasm variant exists because `hf-hub` requires a filesystem and Tokio, neither of which is available in `wasm32-unknown-unknown`. Both providers share the same [`TractOptions`](https://docs.rs/blazen-embed-tract) and pooling logic so callers generic over `EmbeddingModel` work unchanged.

### `WasmTractEmbedModel`

```rust
#[cfg(target_arch = "wasm32")]
use blazen_embed_tract::wasm_provider::{WasmTractEmbedModel, WasmTractError, WasmTractResponse};
use blazen_embed_tract::options::TractOptions;
```

| Method | Signature | Description |
|--------|-----------|-------------|
| `create` | `async fn create(model_url: &str, tokenizer_url: &str, options: TractOptions) -> Result<Self, WasmTractError>` | Fetch ONNX weights and a HuggingFace `tokenizer.json` from the supplied URLs and build a runnable model. `options.cache_dir` and `options.show_download_progress` are ignored on wasm32 |
| `embed` | `async fn embed(&self, texts: &[String]) -> Result<WasmTractResponse, WasmTractError>` | Returns one L2-normalized vector per input. Inference runs synchronously on the wasm main thread |
| `model_id` | `fn model_id(&self) -> &str` | Hugging Face model id resolved from the `TractOptions::model_name` registry entry |
| `dimensions` | `fn dimensions(&self) -> usize` | Output embedding dimensionality |

`WasmTractResponse` exposes `embeddings: Vec<Vec<f32>>` and `model: String`. `WasmTractError` variants: `UnknownModel(String)`, `Fetch { url, message }`, `Init(String)`, `Embed(String)`.

```rust
#[cfg(target_arch = "wasm32")]
{
    use blazen_embed_tract::options::TractOptions;
    use blazen_embed_tract::wasm_provider::WasmTractEmbedModel;

    let opts = TractOptions {
        model_name: Some("bge-small-en-v1.5".to_string()),
        ..Default::default()
    };
    let model = WasmTractEmbedModel::create(
        "https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/onnx/model.onnx",
        "https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/tokenizer.json",
        opts,
    ).await?;
    let out = model.embed(&["hello world".into()]).await?;
}
```

The fetch loop resolves either `window.fetch` (browser) or `globalThis.fetch` (Workers, Deno, Node) so the same provider runs in every wasm host.

`Send + Sync` are implemented vacuously (`unsafe impl`) because `wasm32-unknown-unknown` is single-threaded; this lets `WasmTractEmbedModel` sit behind `Arc<dyn EmbeddingModel>` in target-generic code.

---

## Local Inference Backend Re-exports

When you enable a local-inference feature on `blazen-llm`, the per-backend crate's public types are re-exported at the `blazen_llm` crate root so callers do not need to depend on the backend crate directly. Each group of types is gated by its respective feature.

### `mistralrs` feature

Re-exported **un-prefixed** (the mistralrs backend was the first local provider added and owns the canonical names):

```rust
#[cfg(feature = "mistralrs")]
use blazen_llm::{
    MistralRsProvider, MistralRsOptions, MistralRsError,
    ChatMessageInput, ChatRole,
    InferenceChunk, InferenceChunkStream,
    InferenceImage, InferenceImageSource,
    InferenceResult, InferenceToolCall, InferenceUsage,
};
```

### `llamacpp` feature

Re-exported with a `LlamaCpp` prefix to avoid colliding with the mistralrs names when both features are enabled simultaneously:

```rust
#[cfg(feature = "llamacpp")]
use blazen_llm::{
    LlamaCppProvider, LlamaCppOptions, LlamaCppError,
    LlamaCppChatMessageInput, LlamaCppChatRole,
    LlamaCppInferenceChunk, LlamaCppInferenceChunkStream,
    LlamaCppInferenceResult, LlamaCppInferenceUsage,
};
```

### `candle-llm` feature

Candle exposes its result type as `CandleInferenceResult` (already prefixed upstream). The `CandleLlmCompletionModel` adapter wraps the raw provider in the `CompletionModel` trait:

```rust
#[cfg(feature = "candle-llm")]
use blazen_llm::{
    CandleLlmProvider, CandleLlmCompletionModel,
    CandleLlmOptions, CandleLlmError,
    CandleInferenceResult,
};
```

### Other local backends

| Feature | Re-exports |
|---------|-----------|
| `candle-embed` | `CandleEmbedModel`, `CandleEmbedOptions`, `CandleEmbedError` |
| `embed` | `EmbedModel`, `EmbedOptions`, `EmbedResponse`, `EmbedError` (from `blazen-embed`) |
| `whispercpp` | `WhisperCppProvider`, `WhisperModel`, `WhisperOptions`, `WhisperError` |
| `piper` | `PiperProvider`, `PiperOptions`, `PiperError` |
| `diffusion` | `DiffusionProvider`, `DiffusionOptions`, `DiffusionScheduler`, `DiffusionError` |

All five additional backends follow the same convention — enable the feature on `blazen-llm`, then import directly from `blazen_llm::*`.

---

# Python API Reference

Source: https://blazen.dev/docs/api/python
Language: python
Section: api

## Event

The preferred way to define events is by subclassing `Event`. The `event_type` is automatically set to the class name.

```python
class AnalyzeEvent(Event):
    text: str
    score: float

ev = AnalyzeEvent(text="hello", score=0.9)
ev.event_type   # "AnalyzeEvent"
ev.text          # "hello"
```

You can also construct events inline without a subclass:

```python
Event(event_type: str, **kwargs)
```

```python
ev = Event("AnalyzeEvent", text="hello", score=0.9)
```

| Member | Type | Description |
|---|---|---|
| `.event_type` | `str` | The event type string. Auto-set to the class name for subclasses. |
| `.to_dict()` | `-> dict` | Serialize the event data to a plain dictionary. |
| `.field_name` | `Any` | Attribute access for any keyword argument supplied at construction. |

---

## StartEvent

```python
StartEvent(**kwargs)
```

Built-in event whose `event_type` is `"blazen::StartEvent"`. All keyword arguments are available as attributes.

---

## StopEvent

```python
StopEvent(result=dict)
```

Built-in event whose `event_type` is `"blazen::StopEvent"`.

| Member | Type | Description |
|---|---|---|
| `.result` | `Any` | The value passed via the `result` keyword argument. |

`StopEvent(result=x)` preserves `is`-identity for non-JSON values. Pass class instances, Pydantic models, DB connections, or lambdas as the `result` and `await handler.result()` returns an event whose `.result` attribute is the *same* Python object — non-JSON values are routed through a per-`Context` session-ref registry and `__getattr__` on the returned event resolves the marker transparently.

---

## step decorator

The `@step` decorator reads the type hint of the `ev` parameter to automatically determine which events the step accepts.

```python
class AnalyzeEvent(Event):
    text: str

@step
async def analyze(ctx: Context, ev: AnalyzeEvent) -> Event | None:
    ...
# Equivalent to @step(accepts=["AnalyzeEvent"])
```

When the annotation is the base `Event` class or absent, the step defaults to accepting `StartEvent`:

```python
@step
async def start(ctx: Context, ev: Event) -> Event | None:
    ...
# Equivalent to @step(accepts=["blazen::StartEvent"])
```

Explicit overrides still work:

| Variant | Description |
|---|---|
| `@step` | Infers `accepts` from the `ev` type hint. Defaults to `StartEvent` when the hint is `Event` or missing. |
| `@step(accepts=["EventType"])` | Explicitly sets accepted event types, overriding type-hint inference. |
| `@step(emits=["EventType"])` | Declares the event types this step may produce. |
| `@step(max_concurrency=N)` | Limits how many instances of this step may run concurrently. `0` means unlimited. |

**Step signature**

```python
async def name(ctx: Context, ev: MyEvent) -> Event | list[Event] | None
```

Return an `Event` to emit it, a `list[Event]` to emit several, or `None` to emit nothing. Steps can be sync or async.

---

## Workflow

```python
Workflow(name: str, steps: list, timeout: float = None)
```

Create a workflow from a name and an ordered list of steps. The optional `timeout` is in seconds.

| Method | Signature | Description |
|---|---|---|
| `run` | `await wf.run(**kwargs) -> WorkflowHandler` | Start the workflow. Keyword arguments become fields on the initial `StartEvent`. |

---

## WorkflowHandler

Returned by `Workflow.run()`. Provides control over a running workflow instance.

| Method | Signature | Description |
|---|---|---|
| `result` | `await handler.result() -> Event` | Block until the workflow emits a `StopEvent` and return it. |
| `stream_events` | `handler.stream_events() -> AsyncIterator[Event]` | Async iterator yielding events written to the stream. |

```python
handler = await wf.run(prompt="Hello")

# Stream intermediate events while waiting for the result
async for event in handler.stream_events():
    print(event.event_type, event.to_dict())

result = await handler.result()
```

---

## Context

Available as the first parameter of every step function. All methods are synchronous.

Two explicit namespaces are exposed alongside the smart-routing shortcuts on `ctx` itself:

| Field | Type | Description |
|---|---|---|
| `state` | `StateNamespace` | Persistable workflow state. Routes through the same 4-tier dispatch as `ctx.set` (bytes / JSON / pickle / live-ref). Survives `pause()` / `resume()` and checkpoint stores. |
| `session` | `SessionNamespace` | Live in-process references. Identity is preserved within a single workflow run. Values are deliberately excluded from snapshots. |

| Method | Signature | Description |
|---|---|---|
| `set` | `ctx.set(key: str, value: StateValue) -> None` | Store any Python value via 4-tier dispatch: `bytes`/`bytearray` are stored as raw binary; JSON-serializable types (`dict`, `list`, `str`, `int`, `float`, `bool`, `None`) are stored as JSON; picklable objects (Pydantic models, dataclasses, custom classes) are pickled automatically; unpicklable objects (DB connections, file handles, lambdas) are stored as a live in-process reference (same-process only, excluded from snapshots). |
| `get` | `ctx.get(key: str) -> StateValue \| None` | Retrieve a value by key, or `None` if absent. Returns the original type transparently: JSON values come back as their Python type, bytes come back as `bytes`, and both pickled and live-reference values round-trip to the original Python object. |
| `set_bytes` | `ctx.set_bytes(key: str, data: bytes) -> None` | Convenience alias for storing raw binary data. Equivalent to `ctx.set(key, data)` when `data` is `bytes`. |
| `get_bytes` | `ctx.get_bytes(key: str) -> bytes \| None` | Convenience alias for retrieving raw binary data, or `None` if absent. |
| `run_id` | `ctx.run_id() -> str` | Return the UUID of the current workflow run. |
| `send_event` | `ctx.send_event(event: Event) -> None` | Route an event to matching steps manually. |
| `write_event_to_stream` | `ctx.write_event_to_stream(event: Event) -> None` | Publish an event to the stream visible via `WorkflowHandler.stream_events()`. |

`StateValue = Any` — a type alias defined in the `.pyi` stubs indicating that any Python value is accepted. The first three storage tiers (bytes / JSON / pickle) persist through pause/resume/checkpoint; the fourth tier (live in-process reference) is same-process only.

---

## StateNamespace

Namespace for persistable workflow state, accessed via `ctx.state`. Routes values through the same 4-tier dispatch as `Context.set`: bytes → JSON → pickle → live-object. The first three tiers survive `pause()` / `resume()` and checkpoint stores; the fourth is in-process only.

| Method | Signature | Description |
|---|---|---|
| `set` | `ctx.state.set(key: str, value: StateValue) -> None` | Store a value under `key` using 4-tier dispatch. |
| `get` | `ctx.state.get(key: str) -> StateValue \| None` | Retrieve the value under `key`, deserialized to its original Python type, or `None` if absent. |
| `set_bytes` | `ctx.state.set_bytes(key: str, data: bytes) -> None` | Store raw binary data under `key`. |
| `get_bytes` | `ctx.state.get_bytes(key: str) -> bytes \| None` | Retrieve raw binary data under `key`, or `None` if absent. |

**Dict protocol**

Also supports `__setitem__`, `__getitem__`, and `__contains__` so it can be used like a `dict`:

```python
ctx.state.set("counter", 5)
ctx.state["counter"] = 5         # equivalent
count = ctx.state.get("counter")
print("counter" in ctx.state)    # True
```

---

## SessionNamespace

Namespace for live in-process references, accessed via `ctx.session`. Identity is preserved within a single workflow run; values are deliberately excluded from snapshots, so they are the right place to stash database connections, open file handles, ML model objects, or any other resource that cannot (or should not) be serialized.

| Method | Signature | Description |
|---|---|---|
| `set` | `ctx.session.set(key: str, value: Any) -> None` | Store a live reference to `value` under `key`. |
| `get` | `ctx.session.get(key: str) -> Any \| None` | Retrieve the live reference under `key`, or `None` if absent. The returned object is the *same* Python object that was stored. |
| `has` | `ctx.session.has(key: str) -> bool` | Return whether `key` is currently set. |
| `remove` | `ctx.session.remove(key: str) -> None` | Drop the entry under `key`, if any. |

**Dict protocol**

Also supports `__setitem__`, `__getitem__`, and `__contains__`:

```python
import sqlite3
conn = sqlite3.connect(":memory:")
ctx.session.set("db", conn)
assert ctx.session.get("db") is conn   # same object, always
```

**Pause/resume behavior**

Session values are not serialized into workflow snapshots. What happens to them at pause time is governed by the workflow's `session_pause_policy` (default `"pickle_or_error"`). The policy exists at the workflow level — see the workflow configuration for details.

---

## BlazenState

Base class for typed workflow state with per-field context storage. Subclass with `@dataclass` to get automatic per-field serialization where each field is stored individually using its optimal tier.

```python
from dataclasses import dataclass
from blazen import BlazenState

@dataclass
class MyState(BlazenState):
    input_path: str = ""
    conn: sqlite3.Connection | None = None

    class Meta:
        transient = {"conn"}
        store_by = {}

    def restore(self):
        if self.input_path:
            self.conn = sqlite3.connect(self.input_path)
```

Store and retrieve via context:

```python
ctx.set("state", my_state)   # Stores each field individually
state = ctx.get("state")     # Reconstructs object, then calls restore()
```

### Meta inner class

| Attribute | Type | Description |
|---|---|---|
| `transient` | `ClassVar[set[str]]` | Field names excluded from serialization. These fields are set to `None` in snapshots and recreated by `restore()`. |
| `store_by` | `ClassVar[dict[str, FieldStore]]` | Custom persistence strategy per field. Fields not listed use the default automatic tier (JSON / bytes / pickle). |

### Methods

| Method | Signature | Description |
|---|---|---|
| `restore` | `def restore(self) -> None` | Override to recreate transient fields after deserialization. Called automatically by `ctx.get()` once all serializable fields are populated. Transient fields are `None` at call time. |

---

## FieldStore

**Structural protocol** for custom per-field persistence. There is no `FieldStore` class to import — any object whose shape matches the two methods below can be used as a value in `BlazenState.Meta.store_by` to route specific fields through custom storage (e.g., S3, a database, Redis).

Implement the shape directly with your own class, or use [`CallbackFieldStore`](#callbackfieldstore) below for the common case.

| Method | Signature | Description |
|---|---|---|
| `save` | `def save(self, key: str, value: Any, ctx: Context) -> None` | Persist the field value under the given key. |
| `load` | `def load(self, key: str, ctx: Context) -> Any` | Load and return the field value for the given key. |

---

## CallbackFieldStore

Convenience implementation of the `FieldStore` structural protocol that delegates to plain callables. Importable directly from `blazen`.

```python
from blazen import CallbackFieldStore

CallbackFieldStore(
    save_fn: Callable[[str, Any], None],
    load_fn: Callable[[str], Any],
)
```

| Parameter | Type | Description |
|---|---|---|
| `save_fn` | `Callable[[str, Any], None]` | Called with `(key, value)` to persist a field. |
| `load_fn` | `Callable[[str], Any]` | Called with `(key)` to load a field value. |

Also exposes `save(key, value, ctx)` and `load(key, ctx)` methods matching the `FieldStore` protocol — the `ctx` argument is accepted but not forwarded to your callbacks.

```python
from blazen import CallbackFieldStore

store = CallbackFieldStore(
    save_fn=lambda k, v: s3.put_object(Bucket="b", Key=k, Body=v),
    load_fn=lambda k: s3.get_object(Bucket="b", Key=k)["Body"].read(),
)
```

> This is a tiny ~15-line wrapper. If your store needs richer behavior (the `ctx` argument, async I/O, batching), implement the structural protocol directly with your own class.

---

## CompletionModel

Use static constructor methods to create a model for a specific provider, then call `complete()` or `stream()` to generate responses.

API keys are resolved from `ProviderOptions(api_key=...)` when passed explicitly, or from the provider's standard environment variable (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`) when `options` is omitted.

```python
from blazen import CompletionModel, ProviderOptions

# Read key from OPENAI_API_KEY env var
model = CompletionModel.openai()

# Pass an explicit key
model = CompletionModel.anthropic(options=ProviderOptions(api_key="sk-ant-..."))

# Override the default model
model = CompletionModel.openrouter(
    options=ProviderOptions(api_key="sk-or-...", model="meta-llama/llama-3-70b")
)
```

**Provider constructors**

| Constructor | Signature |
|---|---|
| `openai` | `CompletionModel.openai(*, options: ProviderOptions = None)` |
| `anthropic` | `CompletionModel.anthropic(*, options: ProviderOptions = None)` |
| `gemini` | `CompletionModel.gemini(*, options: ProviderOptions = None)` |
| `azure` | `CompletionModel.azure(*, options: AzureOptions)` |
| `openrouter` | `CompletionModel.openrouter(*, options: ProviderOptions = None)` |
| `groq` | `CompletionModel.groq(*, options: ProviderOptions = None)` |
| `together` | `CompletionModel.together(*, options: ProviderOptions = None)` |
| `mistral` | `CompletionModel.mistral(*, options: ProviderOptions = None)` |
| `deepseek` | `CompletionModel.deepseek(*, options: ProviderOptions = None)` |
| `fireworks` | `CompletionModel.fireworks(*, options: ProviderOptions = None)` |
| `perplexity` | `CompletionModel.perplexity(*, options: ProviderOptions = None)` |
| `xai` | `CompletionModel.xai(*, options: ProviderOptions = None)` |
| `cohere` | `CompletionModel.cohere(*, options: ProviderOptions = None)` |
| `bedrock` | `CompletionModel.bedrock(*, options: BedrockOptions)` |
| `fal` | `CompletionModel.fal(*, options: FalOptions = None)` |

**Properties**

| Property | Type | Description |
|---|---|---|
| `.model_id` | `str` | The string identifier of the active model. |

**complete()**

```python
response: CompletionResponse = await model.complete(
    messages: list[ChatMessage],
    options: CompletionOptions = None,
)
```

Returns a typed `CompletionResponse` (see below). Also supports dict-style access for backwards compatibility: `response["content"]`.

```python
opts = CompletionOptions(temperature=0.7, max_tokens=1024)
response = await model.complete(messages, opts)
```

**stream()**

```python
await model.stream(
    messages: list[ChatMessage],
    on_chunk: Callable[[dict], Any],
    options: CompletionOptions = None,
)
```

Streams a chat completion, calling `on_chunk` for each chunk received. Each chunk is a dict with the following keys:

| Key | Type | Description |
|---|---|---|
| `delta` | `str \| None` | The incremental text content for this chunk. |
| `finish_reason` | `str \| None` | Set on the final chunk (e.g. `"stop"`, `"tool_calls"`). |
| `tool_calls` | `list[dict]` | Tool call fragments, if any. |

```python
def handle(chunk):
    if chunk["delta"]:
        print(chunk["delta"], end="")

await model.stream([ChatMessage.user("Tell me a story")], handle)
```

**CompletionOptions**

Options object for `complete()` and `stream()`. All fields are optional keyword arguments.

```python
opts = CompletionOptions(
    temperature=0.7,
    max_tokens=1024,
    top_p=0.9,
    model="gpt-4o",
    tools=[{"name": "search", "description": "...", "parameters": {...}}],
    response_format={"type": "json_schema", "json_schema": {...}},
)
```

| Field | Type | Description |
|---|---|---|
| `temperature` | `float \| None` | Sampling temperature (0.0-2.0). |
| `max_tokens` | `int \| None` | Maximum tokens to generate. |
| `top_p` | `float \| None` | Nucleus sampling parameter (0.0-1.0). |
| `model` | `str \| None` | Model override for this request. |
| `tools` | `Any \| None` | Tool definitions for function calling. |
| `response_format` | `dict \| None` | JSON schema dict for structured output. |

**Middleware decorators**

Each decorator returns a new `CompletionModel` wrapping the original with additional behaviour.

| Method | Signature | Description |
|---|---|---|
| `with_retry` | `.with_retry(config: RetryConfig \| None = None)` | Automatic retry with exponential backoff on transient failures. |
| `with_cache` | `.with_cache(config: CacheConfig \| None = None)` | In-memory response cache for identical non-streaming requests. |
| `with_fallback` | `CompletionModel.with_fallback(models: list[CompletionModel])` | Static method. Tries providers in order; falls back on transient errors. |

`RetryConfig` and `CacheConfig` are typed configuration objects:

```python
RetryConfig(*, max_retries=3, initial_delay_ms=1000, max_delay_ms=30000, honor_retry_after=True, jitter=True)
CacheConfig(*, strategy: CacheStrategy | None = None, ttl_seconds=300, max_entries=1000)
```

```python
# Chain decorators with typed config objects
model = (
    CompletionModel.openai()
    .with_cache(CacheConfig(ttl_seconds=600))
    .with_retry(RetryConfig(max_retries=5))
)

# Or use defaults (both config args are optional)
model = CompletionModel.openai().with_cache().with_retry()

# Fallback across providers
primary = CompletionModel.openai()
backup = CompletionModel.anthropic()
model = CompletionModel.with_fallback([primary, backup])
```

---

## CompletionResponse

Returned by `model.complete()`. Supports both attribute access and dict-style access.

| Property | Type | Description |
|---|---|---|
| `.content` | `str \| None` | The generated text. |
| `.model` | `str` | Model name used for the completion. |
| `.finish_reason` | `str \| None` | Why generation stopped (`"stop"`, `"tool_calls"`, etc.). |
| `.tool_calls` | `list[ToolCall]` | Tool calls requested by the model. |
| `.usage` | `TokenUsage \| None` | Token usage statistics. |
| `.cost` | `float \| None` | Estimated cost in USD for this request. |
| `.timing` | `RequestTiming \| None` | Timing metadata for the request. |
| `.images` | `list[dict]` | Image outputs (provider-dependent). |
| `.audio` | `list[dict]` | Audio outputs (provider-dependent). |
| `.videos` | `list[dict]` | Video outputs (provider-dependent). |

```python
response = await model.complete([ChatMessage.user("Hello")])
print(response.content)        # attribute access
print(response["content"])     # dict-style access (backwards compatible)
print(response.cost)           # e.g. 0.0023
print(response.timing)         # RequestTiming or None
print(response.keys())         # list of available keys
```

---

## RequestTiming

Timing metadata attached to a `CompletionResponse`. All fields are optional since not every provider reports timing data.

| Property | Type | Description |
|---|---|---|
| `.queue_ms` | `int \| None` | Time spent waiting in the provider's queue. |
| `.execution_ms` | `int \| None` | Time spent executing the request. |
| `.total_ms` | `int \| None` | Total round-trip time. |

```python
response = await model.complete([ChatMessage.user("Hello")])
if response.timing:
    print(f"Total: {response.timing.total_ms}ms")
    print(f"Queue: {response.timing.queue_ms}ms")
    print(f"Execution: {response.timing.execution_ms}ms")
```

---

## ChatMessage

A single message in a chat conversation.

```python
msg = ChatMessage(role="user", content="Hello, world!")
# role is optional, defaults to "user"
msg = ChatMessage(content="Hello!")
```

**Static constructors**

| Method | Description |
|---|---|
| `ChatMessage.system(content: str)` | Create a system message. |
| `ChatMessage.user(content: str)` | Create a user message. |
| `ChatMessage.assistant(content: str)` | Create an assistant message. |
| `ChatMessage.tool(content: str)` | Create a tool result message. |
| `ChatMessage.user_image_url(*, text, url, media_type=None)` | Create a user message with text and an image URL. |
| `ChatMessage.user_image_base64(*, text, data, media_type)` | Create a user message with text and a base64 image. |
| `ChatMessage.user_parts(*, parts: list[ContentPart])` | Create a user message with multiple content parts. |

**Properties**

| Property | Type | Description |
|---|---|---|
| `.role` | `str` | One of `"system"`, `"user"`, `"assistant"`, `"tool"`. |
| `.content` | `str \| None` | The message text. |
| `.tool_call_id` | `str \| None` | For `role="tool"` messages, the call ID this message responds to. `None` otherwise. |
| `.name` | `str \| None` | For `role="tool"` messages, the tool/function name (set by some providers). `None` otherwise. |
| `.tool_result` | `ToolOutput \| None` | The structured tool-result payload. `None` for non-tool messages or when the tool returned a plain string (in which case the string lives in `.content` instead). See [`ToolOutput`](#tooloutput). |

---

## Role

Constants for message roles.

```python
from blazen import Role

Role.SYSTEM     # "system"
Role.USER       # "user"
Role.ASSISTANT  # "assistant"
Role.TOOL       # "tool"
```

---

## ContentPart

Build multimodal content parts for use with `ChatMessage.user_parts()`.

| Factory Method | Description |
|---|---|
| `ContentPart.text(*, text=...)` | Create a text content part. |
| `ContentPart.image_url(*, url=..., media_type=...)` | Create an image URL content part. |
| `ContentPart.image_base64(*, data=..., media_type=...)` | Create a base64 image content part. |

```python
msg = ChatMessage.user_parts(parts=[
    ContentPart.text(text="What's in this image?"),
    ContentPart.image_url(url="https://example.com/photo.jpg", media_type=MediaType.JPEG),
])
```

---

## ToolCall

A tool invocation requested by the model.

| Property | Type | Description |
|---|---|---|
| `.id` | `str` | Unique identifier for the tool call. |
| `.name` | `str` | Name of the tool to invoke. |
| `.arguments` | `dict[str, Any]` | Parsed arguments for the tool call. |

Supports dict-style access: `tool_call["name"]`.

---

## ToolOutput

Two-channel return value from a tool handler. Tool results have two distinct audiences. The caller (your Python code) wants the full structured data; the LLM, on the next turn, may need a different shape — sometimes shorter, sometimes provider-specific. `ToolOutput` carries both channels: `data` is what the caller sees, `llm_override` is what the LLM sees.

```python
import blazen

out = blazen.ToolOutput(data={"items": [1, 2, 3]})
out.data            # {"items": [1, 2, 3]}
out.llm_override    # None

# Explicit override: the caller still gets the full dict, but the LLM
# sees a short summary on its next turn.
out2 = blazen.ToolOutput(
    data={"items": [1, 2, 3], "_debug": "..."},
    llm_override=blazen.LlmPayload.text("Found 3 items."),
)
```

**Constructor**

| Argument | Type | Description |
|---|---|---|
| `data` | `Any` | The structured value the caller sees programmatically. Dict, list, scalar, or string — anything JSON-serializable. |
| `llm_override` | `LlmPayload \| None` | Optional override for what the LLM sees on the next turn. `None` means each provider applies its default conversion from `data` (see [Per-provider behavior](#per-provider-behavior)). |

**Properties**

| Property | Type | Description |
|---|---|---|
| `.data` | `Any` | The user-visible structured payload. Re-materialized as a Python value on each access. |
| `.llm_override` | `LlmPayload \| None` | The LLM-side override, if set. |

A tool handler can return a bare `dict`/`list`/`str`/scalar, in which case Blazen auto-wraps it as `ToolOutput(data=value, llm_override=None)`. Returning a `ToolOutput` instance directly is only required when you want to set `llm_override`.

---

## LlmPayload

Provider-aware override for what the LLM sees as a tool result. Constructed only via the classmethod factories — there is no public `__init__`.

```python
import blazen

blazen.LlmPayload.text("Found 3 results.")
blazen.LlmPayload.json({"items": [1, 2, 3]})
blazen.LlmPayload.provider_raw(
    provider="anthropic",
    value=[{"type": "text", "text": "..."}],
)
```

**Variants**

| Variant `kind` | Constructor | Behavior |
|---|---|---|
| `"text"` | `LlmPayload.text(text: str)` | Plain text. Works on every provider. |
| `"json"` | `LlmPayload.json(value: Any)` | Structured JSON. Anthropic and Gemini natively consume the structure; OpenAI-family stringifies once at the wire boundary. |
| `"provider_raw"` | `LlmPayload.provider_raw(*, provider: str, value: Any)` | Provider-specific escape hatch. Only the named provider sees `value`; every other provider falls back to converting `ToolOutput.data` with its default. `provider` is one of `"openai"`, `"openai_compat"`, `"azure"`, `"anthropic"`, `"gemini"`, `"responses"`, `"fal"`. |

**Properties**

| Property | Type | Description |
|---|---|---|
| `.kind` | `str` | The variant tag: `"text"`, `"json"`, `"parts"`, or `"provider_raw"`. |
| `.text_value` | `str \| None` | The text body for `Text` payloads. `None` for other variants. |
| `.value` | `Any \| None` | The structured value for `Json` and `ProviderRaw` payloads. `None` otherwise. |
| `.provider` | `str \| None` | The provider name for `ProviderRaw` payloads. `None` otherwise. |

### Per-provider behavior

When a tool returns structured `data` and no `llm_override`, each provider sends a sensible default to the LLM:

- **OpenAI / OpenAI-compat / Azure / Responses / Fal**: `data` is JSON-stringified into the `content` field of the tool message.
- **Anthropic**: structured `data` becomes `[{"type": "text", "text": <stringified-json>}]` inside `tool_result.content`.
- **Gemini**: structured object `data` is passed natively as `functionResponse.response`. Scalar values (numbers, booleans, strings) are wrapped as `{"result": <scalar>}`.

Plain strings always pass through unchanged on every provider — a tool returning `"hello"` results in the LLM seeing `hello`, not `"hello"`.

When an `llm_override` is set:

- `LlmPayload.text(...)` is universally accepted.
- `LlmPayload.json(...)` is consumed natively by Anthropic and Gemini and stringified by the OpenAI family.
- `LlmPayload.provider_raw(provider="X", value=V)` only takes effect when the active provider matches `X`. For every other provider the dispatcher falls back to converting `ToolOutput.data` with the provider's default rule above.

---

## TokenUsage

Token usage statistics for a completion.

| Property | Type | Description |
|---|---|---|
| `.prompt_tokens` | `int` | Tokens in the prompt. |
| `.completion_tokens` | `int` | Tokens in the completion. |
| `.total_tokens` | `int` | Total tokens used. |

Supports dict-style access: `usage["total_tokens"]`.

---

## Agent System

The agent system provides an agentic tool-execution loop on top of `CompletionModel`. Define tools with `ToolDef`, then call `run_agent` to let the model iteratively call tools until it produces a final answer.

### ToolDef

Define a tool that the model can invoke during an agent run.

```python
ToolDef(
    *,
    name: str,
    description: str,
    parameters: dict[str, Any],
    handler: Callable | AsyncCallable,
)
```

| Parameter | Type | Description |
|---|---|---|
| `name` | `str` | Unique tool name exposed to the model. |
| `description` | `str` | Description the model uses to decide when to call this tool. |
| `parameters` | `dict` | JSON Schema describing the tool's input parameters. |
| `handler` | `Callable` | Function called when the model invokes the tool. Can be sync or async. Receives a `dict[str, Any]` of arguments and returns either a JSON-serializable value (auto-wrapped into `ToolOutput(data=value)`) or an explicit [`ToolOutput`](#tooloutput) when you want to override what the LLM sees. |

A handler may return:

1. **A bare `dict`, `list`, scalar, or `str`** — Blazen wraps it as `ToolOutput(data=value, llm_override=None)` and each provider applies its default conversion.
2. **A `ToolOutput`** — full control. Set `llm_override` to send the LLM a different shape than the structured `data` your code reads back from `messages[-1].tool_result`.

```python
import blazen
from blazen import ToolDef

# 1. Sync handler — return a bare dict, auto-wrapped.
tool = ToolDef(
    name="search",
    description="Search the web for a query",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"}
        },
        "required": ["query"],
    },
    handler=lambda args: {"results": ["result1", "result2"]},
)

# 2. Async handler — return a bare dict, auto-wrapped.
async def fetch_weather(args):
    data = await weather_api(args["city"])
    return {"temperature": data.temp, "conditions": data.conditions}

weather_tool = ToolDef(
    name="weather",
    description="Get current weather for a city",
    parameters={
        "type": "object",
        "properties": {
            "city": {"type": "string"}
        },
        "required": ["city"],
    },
    handler=fetch_weather,
)

# 3. Explicit ToolOutput — LLM sees a short summary, but the caller's
#    `messages[-1].tool_result.data` still has the full structured payload.
def search_with_summary(args):
    return blazen.ToolOutput(
        data={"items": [1, 2, 3], "raw_response": "..."},
        llm_override=blazen.LlmPayload.text("Found 3 items."),
    )

search_tool = ToolDef(
    name="search",
    description="Search for items.",
    parameters={
        "type": "object",
        "properties": {"q": {"type": "string"}},
        "required": ["q"],
    },
    handler=search_with_summary,
)
```

After a tool runs, the result message in the conversation history exposes both channels:

```python
result = await run_agent(model, messages, tools=[search_tool])
last = result.messages[-1]
last.role          # "tool"
last.tool_call_id  # the call ID this responded to
last.tool_result   # ToolOutput | None — None if the handler returned a plain string
last.tool_result.data          # full caller-visible payload
last.tool_result.llm_override  # the LlmPayload sent to the LLM, if any
```

### run_agent

Run an agentic tool-execution loop. The model is called repeatedly, executing any requested tool calls and feeding results back, until the model stops calling tools or `max_iterations` is reached.

```python
result: AgentResult = await run_agent(
    model: CompletionModel,
    messages: list[ChatMessage],
    *,
    tools: list[ToolDef],
    max_iterations: int = 10,
    system_prompt: str = None,
    temperature: float = None,
    max_tokens: int = None,
    add_finish_tool: bool = False,
)
```

| Parameter | Type | Default | Description |
|---|---|---|---|
| `model` | `CompletionModel` | required | The model to use for completions. |
| `messages` | `list[ChatMessage]` | required | Initial conversation messages. |
| `tools` | `list[ToolDef]` | required | Tools available to the model. |
| `max_iterations` | `int` | `10` | Maximum number of tool-call rounds before stopping. |
| `system_prompt` | `str \| None` | `None` | Optional system prompt prepended to messages. |
| `temperature` | `float \| None` | `None` | Sampling temperature override. |
| `max_tokens` | `int \| None` | `None` | Max tokens per completion call. |
| `add_finish_tool` | `bool` | `False` | If `True`, adds a built-in "finish" tool the model can call to explicitly end the loop. |

```python
model = CompletionModel.openai()
messages = [ChatMessage.user("What's the weather in Paris and London?")]

result = await run_agent(model, messages, tools=[weather_tool])
print(result.response.content)  # Final answer
print(result.iterations)        # Number of tool-call rounds
print(result.total_cost)        # Accumulated cost across all iterations
```

### AgentResult

Returned by `run_agent`.

| Property | Type | Description |
|---|---|---|
| `.response` | `CompletionResponse` | The final completion response from the model. |
| `.messages` | `list[ChatMessage]` | The full conversation history including all tool calls and results. |
| `.iterations` | `int` | Number of tool-call iterations executed. |
| `.total_cost` | `float \| None` | Total cost in USD accumulated across all iterations. |

---

## MediaType

Constants for common MIME types. Useful when constructing `ContentPart` or compute requests.

```python
from blazen import MediaType

MediaType.PNG   # "image/png"
MediaType.MP4   # "video/mp4"
MediaType.MP3   # "audio/mpeg"
MediaType.GLB   # "model/gltf-binary"
```

**Image types**

| Constant | MIME Type |
|---|---|
| `MediaType.PNG` | `image/png` |
| `MediaType.JPEG` | `image/jpeg` |
| `MediaType.WEBP` | `image/webp` |
| `MediaType.GIF` | `image/gif` |
| `MediaType.SVG` | `image/svg+xml` |
| `MediaType.BMP` | `image/bmp` |
| `MediaType.TIFF` | `image/tiff` |
| `MediaType.AVIF` | `image/avif` |

**Video types**

| Constant | MIME Type |
|---|---|
| `MediaType.MP4` | `video/mp4` |
| `MediaType.WEBM` | `video/webm` |
| `MediaType.MOV` | `video/quicktime` |

**Audio types**

| Constant | MIME Type |
|---|---|
| `MediaType.MP3` | `audio/mpeg` |
| `MediaType.WAV` | `audio/wav` |
| `MediaType.OGG` | `audio/ogg` |
| `MediaType.FLAC` | `audio/flac` |
| `MediaType.AAC` | `audio/aac` |
| `MediaType.M4A` | `audio/mp4` |

**3D model types**

| Constant | MIME Type |
|---|---|
| `MediaType.GLB` | `model/gltf-binary` |
| `MediaType.GLTF` | `model/gltf+json` |
| `MediaType.OBJ` | `model/obj` |
| `MediaType.USDZ` | `model/vnd.usdz+zip` |
| `MediaType.FBX` | `model/fbx` |
| `MediaType.STL` | `model/stl` |

**Document types**

| Constant | MIME Type |
|---|---|
| `MediaType.PDF` | `application/pdf` |

### MediaSource alias

`MediaSource` is a module-level alias for `ImageSource`, exported from `blazen` so callers writing media-source-typed code can use the more descriptive name:

```python
from blazen import MediaSource  # alias of ImageSource

src = MediaSource.from_path("./photo.png")
```

The two names refer to the same class -- `isinstance(src, ImageSource)` is `True`.

---

## Content Subsystem

A pluggable registry for multimodal blobs (images, audio, video, documents, 3D, CAD) that lets tool handlers emit and accept opaque handles instead of inline bytes. Handles are resolved to a wire-renderable `MediaSource` only when the model actually needs the content.

### ContentKind

Taxonomy enum of multimodal content kinds. Used by tool-input declarations and `ContentStore` routing.

```python
from blazen import ContentKind

ContentKind.Image
ContentKind.from_str("three_d_model")
ContentKind.from_mime("image/png")
ContentKind.from_extension("glb")
```

**Members**

| Member | Wire name (`name_str`) |
|---|---|
| `ContentKind.Image` | `"image"` |
| `ContentKind.Audio` | `"audio"` |
| `ContentKind.Video` | `"video"` |
| `ContentKind.Document` | `"document"` |
| `ContentKind.ThreeDModel` | `"three_d_model"` |
| `ContentKind.Cad` | `"cad"` |
| `ContentKind.Archive` | `"archive"` |
| `ContentKind.Font` | `"font"` |
| `ContentKind.Code` | `"code"` |
| `ContentKind.Data` | `"data"` |
| `ContentKind.Other` | `"other"` |

**Methods**

| Method | Description |
|---|---|
| `kind.name_str` | Property returning the canonical wire name (matches the JSON / serde tag). |
| `ContentKind.from_str(value: str)` | Parse a kind from its canonical wire name. Unknown names raise `ValueError`. |
| `ContentKind.from_mime(mime: str)` | Map a MIME type to a kind. Unknown MIME types resolve to `ContentKind.Other`. |
| `ContentKind.from_extension(ext: str)` | Map a filename extension (no leading dot, case-insensitive) to a kind. Unknown extensions resolve to `ContentKind.Other`. |

---

### ContentHandle

Stable, opaque reference to content registered with a `ContentStore`. Most callers obtain handles from `ContentStore.put` rather than constructing them directly.

```python
ContentHandle(
    id: str,
    kind: ContentKind,
    *,
    mime_type: str | None = None,
    byte_size: int | None = None,
    display_name: str | None = None,
)
```

**Properties** (all read-only)

| Property | Type | Description |
|---|---|---|
| `.id` | `str` | Opaque, store-defined identifier. Treat as a black box. |
| `.kind` | `ContentKind` | What kind of content this handle refers to. |
| `.mime_type` | `str \| None` | MIME type if known. |
| `.byte_size` | `int \| None` | Byte size if known. |
| `.display_name` | `str \| None` | Human-readable display name (e.g. original filename) if known. |

---

### ContentStore

A pluggable content registry that backs handle resolution. Construct via one of the static factories. All instance methods return awaitables.

```python
from blazen import ContentStore, ContentKind

store = ContentStore.in_memory()
handle = await store.put(b"...", kind=ContentKind.Image, mime_type="image/png")
source = await store.resolve(handle)   # {"type": "base64", "data": "..."}
```

**Static factories**

| Factory | Returns |
|---|---|
| `ContentStore.in_memory()` | `ContentStore` |
| `ContentStore.local_file(path: str \| os.PathLike \| pathlib.Path)` | `ContentStore` |
| `ContentStore.openai_files(api_key: str, *, base_url: str \| None = None)` | `ContentStore` |
| `ContentStore.anthropic_files(api_key: str, *, base_url: str \| None = None)` | `ContentStore` |
| `ContentStore.gemini_files(api_key: str, *, base_url: str \| None = None)` | `ContentStore` |
| `ContentStore.fal_storage(api_key: str, *, base_url: str \| None = None)` | `ContentStore` |
| `ContentStore.custom(*, put, resolve, fetch_bytes, fetch_stream=None, delete=None, name="custom")` | `ContentStore` |

**Instance methods** (all `await`able)

| Method | Description |
|---|---|
| `await store.put(body, *, kind=None, mime_type=None, display_name=None, byte_size=None)` | Persist content and return a freshly-issued `ContentHandle`. `body` may be `bytes`, a URL `str`, or a `pathlib.Path`. Keyword arguments are optional hints; the store may auto-detect kind / MIME from the bytes. |
| `await store.resolve(handle)` | Resolve a handle to a wire-renderable media source. Returns a serialized `MediaSource` dict, e.g. `{"type": "url", "url": "..."}` or `{"type": "base64", "data": "..."}`. |
| `await store.fetch_bytes(handle)` | Fetch raw bytes for a handle. Stores that hold only references (URL / provider-file / local-path) may raise `UnsupportedError`. |
| `await store.metadata(handle)` | Cheap metadata lookup with no byte materialization. Returns `{"kind": ..., "mime_type": ..., "byte_size": ..., "display_name": ...}`. |
| `await store.delete(handle)` | Optional cleanup hook. Default backends drop the entry; provider backends issue a delete call to the upstream API. |

#### Subclassing `ContentStore`

`ContentStore` is subclassable from Python. Override the methods your backend needs; the framework wraps your subclass in a Rust adapter (`PyHostContentStore`) that dispatches into your Python coroutines.

```python
from blazen import ContentStore, ContentHandle, ContentKind

class S3ContentStore(ContentStore):
    def __init__(self, bucket: str):
        super().__init__()
        self.bucket = bucket

    async def put(self, body, hint) -> ContentHandle:
        ...

    async def resolve(self, handle) -> dict:
        ...

    async def fetch_bytes(self, handle) -> bytes:
        ...

    # Optional overrides:
    async def fetch_stream(self, handle): ...
    async def delete(self, handle) -> None: ...
```

Subclasses MUST override `put`, `resolve`, `fetch_bytes`. The base-class default impls raise `NotImplementedError` so any missing override is an immediate clear error rather than silent infinite recursion via `super()`.

#### `ContentStore.custom(...)`

Callback-based factory. Direct Python mirror of Rust `CustomContentStore::builder`.

```python
ContentStore.custom(
    *,
    put: Callable[..., Awaitable[ContentHandle]],
    resolve: Callable[[ContentHandle], Awaitable[dict]],
    fetch_bytes: Callable[[ContentHandle], Awaitable[bytes]],
    fetch_stream: Callable[[ContentHandle], Awaitable[bytes]] | None = None,
    delete: Callable[[ContentHandle], Awaitable[None]] | None = None,
    name: str = "custom",
) -> ContentStore
```

`put`, `resolve`, `fetch_bytes` are required. `fetch_stream` and `delete` are optional. The `body` argument to `put` arrives as a dict shaped like `{"type": "bytes", "data": [...]}` / `{"type": "url", "url": "..."}` / `{"type": "local_path", "path": "..."}` / `{"type": "provider_file", "provider": "openai", "id": "..."}` / `{"type": "stream", "stream": <AsyncByteIter>, "size_hint": int | None}`. The `hint` is a dict with optional `mime_type` / `kind_hint` / `display_name` / `byte_size`.

`put` must return a `ContentHandle`. `resolve` returns a serialized `MediaSource` dict (e.g. `{"type": "url", "url": "..."}`). `fetch_bytes` returns raw `bytes`. `fetch_stream` may return either `bytes` (legacy, single-chunk) or an `AsyncIterator[bytes]` for true chunk-by-chunk streaming.

---

### Built-in stores

| Factory | Purpose | Notes |
|---|---|---|
| `ContentStore.in_memory()` | Ephemeral / test workloads. | Holds bytes in process memory; resolves to `base64`. |
| `ContentStore.local_file(path)` | Filesystem-backed persistence rooted at `path`. | Directory is created recursively if missing. Resolves to a local path / URL. |
| `ContentStore.openai_files(api_key, *, base_url=None)` | OpenAI Files API. | Resolves to a provider file reference; `fetch_bytes` may not be supported. |
| `ContentStore.anthropic_files(api_key, *, base_url=None)` | Anthropic Files API (beta). | Provider-file backed. |
| `ContentStore.gemini_files(api_key, *, base_url=None)` | Google Gemini Files API. | Provider-file backed. |
| `ContentStore.fal_storage(api_key, *, base_url=None)` | fal.ai object storage. | Resolves to URL references hosted on fal. |
| `ContentStore.custom(...)` | User-defined backend via async callables (see above). | Required: `put`, `resolve`, `fetch_bytes`. Optional: `fetch_stream`, `delete`, `name`. |

---

### Tool-input schema helpers

Top-level functions that return JSON-Schema-shaped `dict` fragments declaring a single required content-reference input. Each fragment carries an `x-blazen-content-ref` extension keyed to the appropriate `ContentKind`.

| Helper | Description |
|---|---|
| `image_input(name, description)` | Schema declaring a single required image input. |
| `audio_input(name, description)` | Schema declaring a single required audio input. |
| `video_input(name, description)` | Schema declaring a single required video input. |
| `file_input(name, description)` | Schema declaring a single required document / file input. |
| `three_d_input(name, description)` | Schema declaring a single required 3D-model input. |
| `cad_input(name, description)` | Schema declaring a single required CAD-file input. |

```python
from blazen import image_input

schema = image_input("photo", "The photo to analyze")
# {
#   "type": "object",
#   "properties": {
#     "photo": {
#       "type": "string",
#       "description": "The photo to analyze",
#       "x-blazen-content-ref": {"kind": "image"}
#     }
#   },
#   "required": ["photo"]
# }
```

---

### Generic schema builders

For schemas that need extra companion fields alongside the content reference, or for embedding a content-ref property inside a larger object schema you assemble yourself.

| Helper | Description |
|---|---|
| `content_ref_property(kind: ContentKind, description: str)` | Build a single-property fragment of the form `{"type": "string", "description": ..., "x-blazen-content-ref": {"kind": ...}}`, ready to embed inside an object schema's `properties` map. |
| `content_ref_required_object(name: str, kind: ContentKind, description: str, *, extra_properties: dict \| None = None)` | Build a complete object-typed schema declaring a single required content-reference input plus optional companion fields merged from `extra_properties`. |

```python
from blazen import ContentKind, content_ref_property, content_ref_required_object

# Single property fragment for manual assembly.
prop = content_ref_property(ContentKind.Image, "Source frame")
schema = {
    "type": "object",
    "properties": {"frame": prop, "threshold": {"type": "number"}},
    "required": ["frame"],
}

# Or the full-object form with companion fields merged in.
schema = content_ref_required_object(
    "frame",
    ContentKind.Image,
    "Source frame",
    extra_properties={"threshold": {"type": "number"}},
)
```

---

### How resolution works

The `x-blazen-content-ref` JSON Schema extension is invisible to the model and to the upstream provider -- they see a plain `string` property. When the model emits `{"photo": "blazen_a1b2c3..."}`, Blazen's resolver intercepts the tool-call arguments before invoking the handler, looks up each content-ref string in the active `ContentStore`, and substitutes the bare id with a typed content dict of the form:

```python
{
    "kind": "image",
    "handle_id": "blazen_a1b2c3...",
    "mime_type": "image/png",
    "byte_size": 48213,
    "display_name": "frame.png",
    "source": {"type": "url", "url": "..."},   # serialized MediaSource
}
```

The tool handler receives the substituted shape and never has to call `store.resolve` itself. Handles produced by a tool's return value flow back through the same store, so subsequent turns can reference them by id without re-uploading bytes.

---

## Compute Request Types

Compute requests define jobs for media generation and processing. All requests are plain `dict` objects.

### ImageRequest

Generate images from a text prompt. Passed as a plain `dict`.

```python
{
    "prompt": str,                # required
    "negative_prompt": str,       # optional
    "width": int,                 # optional
    "height": int,                # optional
    "num_images": int,            # optional
    "model": str,                 # optional
}
```

| Key | Type | Required | Description |
|---|---|---|---|
| `"prompt"` | `str` | yes | Text description of the image to generate. |
| `"negative_prompt"` | `str` | no | What to avoid in the generated image. |
| `"width"` | `int` | no | Image width in pixels. |
| `"height"` | `int` | no | Image height in pixels. |
| `"num_images"` | `int` | no | Number of images to generate. |
| `"model"` | `str` | no | Specific model to use (provider-dependent). |

```python
req = {"prompt": "a cat in space", "width": 1024, "height": 1024, "num_images": 2}
```

### UpscaleRequest

Upscale an existing image to a higher resolution. Passed as a plain `dict`.

```python
{
    "image_url": str,   # required
    "scale": float,     # required
    "model": str,       # optional
}
```

| Key | Type | Required | Description |
|---|---|---|---|
| `"image_url"` | `str` | yes | URL of the image to upscale. |
| `"scale"` | `float` | yes | Upscale factor (e.g. `2.0`, `4.0`). |
| `"model"` | `str` | no | Specific model to use. |

```python
req = {"image_url": "https://example.com/photo.jpg", "scale": 4.0}
```

### VideoRequest

Generate a video from a text prompt, optionally with an input image. Passed as a plain `dict`.

```python
{
    "prompt": str,                # required
    "image_url": str,             # optional
    "duration_seconds": float,    # optional
    "negative_prompt": str,       # optional
    "width": int,                 # optional
    "height": int,                # optional
    "model": str,                 # optional
}
```

| Key | Type | Required | Description |
|---|---|---|---|
| `"prompt"` | `str` | yes | Text description of the video to generate. |
| `"image_url"` | `str` | no | Optional starting image to animate. |
| `"duration_seconds"` | `float` | no | Desired video length in seconds. |
| `"negative_prompt"` | `str` | no | What to avoid in the generated video. |
| `"width"` | `int` | no | Video width in pixels. |
| `"height"` | `int` | no | Video height in pixels. |
| `"model"` | `str` | no | Specific model to use. |

```python
req = {"prompt": "a sunset timelapse", "duration_seconds": 5.0}
req = {"prompt": "animate this scene", "image_url": "https://example.com/frame.jpg"}
```

### SpeechRequest

Generate speech audio from text. Passed as a plain `dict`.

```python
{
    "text": str,          # required
    "voice": str,         # optional
    "voice_url": str,     # optional
    "language": str,      # optional
    "speed": float,       # optional
    "model": str,         # optional
}
```

| Key | Type | Required | Description |
|---|---|---|---|
| `"text"` | `str` | yes | The text to convert to speech. |
| `"voice"` | `str` | no | Voice preset name (e.g. `"alloy"`, `"nova"`). |
| `"voice_url"` | `str` | no | URL to a custom voice sample for cloning. |
| `"language"` | `str` | no | Language code (e.g. `"en"`, `"fr"`). |
| `"speed"` | `float` | no | Playback speed multiplier (e.g. `1.2` for 20% faster). |
| `"model"` | `str` | no | Specific model to use. |

```python
req = {"text": "Hello world", "voice": "alloy", "speed": 1.2}
```

### MusicRequest

Generate music or sound effects from a text prompt. Passed as a plain `dict`.

```python
{
    "prompt": str,                # required
    "duration_seconds": float,    # optional
    "model": str,                 # optional
}
```

| Key | Type | Required | Description |
|---|---|---|---|
| `"prompt"` | `str` | yes | Description of the music to generate. |
| `"duration_seconds"` | `float` | no | Desired duration in seconds. |
| `"model"` | `str` | no | Specific model to use. |

```python
req = {"prompt": "upbeat jazz", "duration_seconds": 30.0}
```

### TranscriptionRequest

Transcribe audio to text. Passed as a plain `dict`.

```python
{
    "audio_url": str,    # required
    "language": str,     # optional
    "diarize": bool,     # optional
    "model": str,        # optional
}
```

| Key | Type | Required | Description |
|---|---|---|---|
| `"audio_url"` | `str` | yes | URL of the audio file to transcribe. |
| `"language"` | `str` | no | Language hint (e.g. `"en"`). |
| `"diarize"` | `bool` | no | If `True`, identify and label different speakers. |
| `"model"` | `str` | no | Specific model to use. |

```python
req = {"audio_url": "https://example.com/audio.mp3", "language": "en", "diarize": True}
```

### ThreeDRequest

Generate a 3D model from a text prompt or image. Passed as a plain `dict`.

```python
{
    "prompt": str,       # optional (provide at least one of prompt or image_url)
    "image_url": str,    # optional
    "format": str,       # optional
    "model": str,        # optional
}
```

| Key | Type | Required | Description |
|---|---|---|---|
| `"prompt"` | `str` | no | Text description of the 3D object to generate. |
| `"image_url"` | `str` | no | Image to use as reference for 3D generation. |
| `"format"` | `str` | no | Output format (e.g. `"glb"`, `"obj"`, `"usdz"`). |
| `"model"` | `str` | no | Specific model to use. |

Provide at least one of `"prompt"` or `"image_url"`.

```python
req = {"prompt": "a 3D cat", "format": "glb"}
req = {"image_url": "https://example.com/photo.jpg", "format": "obj"}
```

---

## StreamChunk

A typed object received by the `on_chunk` callback during streaming. Replaces the raw dict interface while remaining backwards-compatible via `chunk["key"]` access.

| Property | Type | Description |
|---|---|---|
| `.delta` | `str \| None` | Incremental text content. |
| `.finish_reason` | `str \| None` | Present only on the final chunk (`"stop"`, `"tool_calls"`, etc.). |
| `.tool_calls` | `list[ToolCall]` | Tool invocations completed in this chunk. |

```python
async def on_chunk(chunk):
    # Attribute access (preferred)
    if chunk.delta:
        print(chunk.delta, end="")

    # Dict-style access (backwards compatible)
    if chunk["finish_reason"]:
        print(f"\n[done: {chunk['finish_reason']}]")
```

---

## EmbeddingModel

Generate vector embeddings from text. Created via static constructor methods, similar to `CompletionModel`. Keys are read from environment variables (`OPENAI_API_KEY`, etc.) when `options` is omitted, or can be passed explicitly via `ProviderOptions(api_key=...)`.

```python
from blazen import EmbeddingModel, ProviderOptions

model = EmbeddingModel.openai()
model = EmbeddingModel.openai(model="text-embedding-3-large", dimensions=3072)
model = EmbeddingModel.openai(options=ProviderOptions(api_key="sk-..."))
model = EmbeddingModel.together()
model = EmbeddingModel.cohere()
model = EmbeddingModel.fireworks()
```

**Provider constructors**

| Constructor | Signature |
|---|---|
| `openai` | `EmbeddingModel.openai(*, options: ProviderOptions = None, model: str = None, dimensions: int = None)` |
| `together` | `EmbeddingModel.together(*, options: ProviderOptions = None)` |
| `cohere` | `EmbeddingModel.cohere(*, options: ProviderOptions = None)` |
| `fireworks` | `EmbeddingModel.fireworks(*, options: ProviderOptions = None)` |

**Properties**

| Property | Type | Description |
|---|---|---|
| `.model_id` | `str` | The model identifier. |
| `.dimensions` | `int` | Output vector dimensionality. |

**embed()**

```python
response: EmbeddingResponse = await model.embed(texts: list[str])
```

Returns an `EmbeddingResponse` with one vector per input text.

---

## EmbeddingResponse

Returned by `EmbeddingModel.embed()`.

| Property | Type | Description |
|---|---|---|
| `.embeddings` | `list[list[float]]` | One vector per input text. |
| `.model` | `str` | Model that produced the embeddings. |
| `.usage` | `TokenUsage \| None` | Token usage statistics. |
| `.cost` | `float \| None` | Estimated cost in USD. |
| `.timing` | `RequestTiming \| None` | Request timing breakdown. |

```python
response = await model.embed(["Hello", "World"])
print(len(response.embeddings))       # 2
print(len(response.embeddings[0]))    # 1536
print(response.model)                 # "text-embedding-3-small"
print(response.cost)                  # e.g. 0.0001
```

---

## Token Estimation

Lightweight token counting functions that work without external data files. Uses a heuristic (~3.5 characters per token) suitable for budget checks.

**estimate_tokens()**

```python
from blazen import estimate_tokens

count = estimate_tokens("Hello, world!")          # 4
count = estimate_tokens("Hello, world!", 32000)   # same, with custom context size
```

| Parameter | Type | Default | Description |
|---|---|---|---|
| `text` | `str` | required | The text to estimate. |
| `context_size` | `int` | `128000` | Context window size hint. |

**count_message_tokens()**

```python
from blazen import count_message_tokens, ChatMessage

count = count_message_tokens([
    ChatMessage.system("You are helpful."),
    ChatMessage.user("Hello!"),
])
```

Includes per-message overhead (role markers, separators) in addition to content tokens.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `messages` | `list[ChatMessage]` | required | Messages to count. |
| `context_size` | `int` | `128000` | Context window size hint. |

---

## Subclassable Providers

`CompletionModel`, `EmbeddingModel`, and `Transcription` can be subclassed to implement custom providers. Override the relevant methods and the framework will dispatch to your implementation.

### CompletionModel

```python
from blazen import CompletionModel, ChatMessage

class MyLLM(CompletionModel):
    def __init__(self):
        super().__init__(model_id="my-llm")

    async def complete(self, messages, options=None):
        # Your inference logic here
        return {"content": "Hello from my custom model"}

    async def stream(self, messages, on_chunk, options=None):
        on_chunk({"delta": "Hello", "finish_reason": None, "tool_calls": []})
        on_chunk({"delta": None, "finish_reason": "stop", "tool_calls": []})

model = MyLLM()
response = await model.complete([ChatMessage.user("Hi")])
```

### EmbeddingModel

```python
from blazen import EmbeddingModel

class MyEmbedder(EmbeddingModel):
    def __init__(self):
        super().__init__(model_id="my-embedder", dimensions=128)

    async def embed(self, texts):
        return {"embeddings": [[0.1] * 128 for _ in texts], "model": "my-embedder"}
```

### Transcription

```python
from blazen import Transcription

class MyTranscriber(Transcription):
    def __init__(self):
        super().__init__(provider_id="my-stt")

    async def transcribe(self, request):
        return {"text": "transcribed text", "segments": []}
```

---

## Per-Capability Provider Classes

Seven provider base classes let you implement a single compute capability without dealing with the full `ComputeProvider` interface. Subclass and override the relevant methods.

| Class | Methods to Override | Rust Trait |
|---|---|---|
| `TTSProvider` | `text_to_speech(request)` | `AudioGeneration` |
| `MusicProvider` | `generate_music(request)`, `generate_sfx(request)` | `AudioGeneration` |
| `ImageProvider` | `generate_image(request)`, `upscale_image(request)` | `ImageGeneration` |
| `VideoProvider` | `text_to_video(request)`, `image_to_video(request)` | `VideoGeneration` |
| `ThreeDProvider` | `generate_3d(request)` | `ThreeDGeneration` |
| `BackgroundRemovalProvider` | `remove_background(request)` | `BackgroundRemoval` |
| `VoiceProvider` | `clone_voice(request)`, `list_voices()`, `delete_voice(voice)` | `VoiceCloning` |

### Constructor

All provider classes share the same constructor signature:

```python
TTSProvider(
    *,
    provider_id: str,
    base_url: str | None = None,
    pricing: ModelPricing | None = None,
    vram_estimate_bytes: int | None = None,
)
```

| Parameter | Type | Description |
|---|---|---|
| `provider_id` | `str` | Identifier for the provider instance. |
| `base_url` | `str \| None` | Optional base URL for the provider API. |
| `pricing` | `ModelPricing \| None` | Optional pricing info for cost tracking. |
| `vram_estimate_bytes` | `int \| None` | Optional VRAM estimate for `ModelManager` integration. |

### Example

```python
from blazen import TTSProvider

class ElevenLabsTTS(TTSProvider):
    def __init__(self, api_key: str):
        super().__init__(provider_id="elevenlabs")
        self.api_key = api_key

    async def text_to_speech(self, request):
        # Call ElevenLabs API with self.api_key
        return {"audio": audio_bytes, "format": "mp3"}

tts = ElevenLabsTTS(api_key="sk-...")
result = await tts.text_to_speech({"text": "Hello world", "voice": "alice"})
```

---

## MemoryBackend

Base class for custom memory storage backends. Subclass to implement persistence backed by Postgres, SQLite, DynamoDB, or any other store.

```python
from blazen import MemoryBackend

class PostgresBackend(MemoryBackend):
    async def put(self, entry):
        # Insert or update entry in Postgres
        ...

    async def get(self, id):
        # Retrieve entry by id
        ...

    async def delete(self, id):
        # Delete entry, return True if it existed
        ...

    async def list(self):
        # Return all entries
        ...

    async def len(self):
        # Return count of entries
        ...

    async def search_by_bands(self, bands, limit):
        # Return candidates sharing LSH bands with the query
        ...
```

### Methods to Override

| Method | Signature | Description |
|---|---|---|
| `put` | `async def put(self, entry) -> None` | Insert or update a stored entry. |
| `get` | `async def get(self, id: str) -> dict \| None` | Retrieve a stored entry by id. |
| `delete` | `async def delete(self, id: str) -> bool` | Delete an entry by id. Returns `True` if it existed. |
| `list` | `async def list(self) -> list[dict]` | Return all stored entries. |
| `len` | `async def len(self) -> int` | Return the number of stored entries. |
| `search_by_bands` | `async def search_by_bands(self, bands, limit) -> list[dict]` | Return candidate entries sharing at least one LSH band. |

---

## ModelManager

VRAM budget-aware model manager with LRU eviction. Tracks registered local models and their estimated VRAM footprint. When loading a model that would exceed the budget, the least-recently-used loaded model is unloaded first.

### Constructor

```python
from blazen import ModelManager

manager = ModelManager(budget_gb=24)
# or
manager = ModelManager(budget_bytes=24 * 1_073_741_824)
```

| Parameter | Type | Description |
|---|---|---|
| `budget_gb` | `float \| None` | VRAM budget in gigabytes. Provide exactly one of `budget_gb` or `budget_bytes`. |
| `budget_bytes` | `int \| None` | VRAM budget in bytes. |

### Methods

| Method | Signature | Description |
|---|---|---|
| `register` | `await manager.register(id, model, vram_estimate)` | Register a model with its estimated VRAM footprint. Starts unloaded. |
| `load` | `await manager.load(id)` | Load a model, evicting LRU models if needed. |
| `unload` | `await manager.unload(id)` | Unload a model and free its VRAM. |
| `is_loaded` | `await manager.is_loaded(id) -> bool` | Check if a model is currently loaded. |
| `ensure_loaded` | `await manager.ensure_loaded(id)` | Alias for `load()`. |
| `used_bytes` | `await manager.used_bytes() -> int` | Total VRAM currently used by loaded models. |
| `available_bytes` | `await manager.available_bytes() -> int` | Available VRAM within the budget. |
| `status` | `await manager.status() -> list[ModelStatus]` | Status of all registered models. |

### ModelStatus

| Property | Type | Description |
|---|---|---|
| `.id` | `str` | Model identifier. |
| `.loaded` | `bool` | Whether the model is currently loaded. |
| `.vram_estimate` | `int` | Estimated VRAM footprint in bytes. |

---

## ModelRegistry

An ABC for advertising a model catalog. Subclass to plug a custom catalog (a static manifest, a remote control-plane lookup, a cache of `/v1/models` results) into Blazen's model-info surface. Mirrors the Rust trait `blazen_llm::traits::ModelRegistry` and the equivalent ABCs in the Node and WASM SDKs.

Both methods are abstract -- the default implementations raise `NotImplementedError`, so a subclass must override both.

```python
class ModelRegistry:
    async def list_models(self) -> list[ModelInfo]: ...
    async def get_model(self, model_id: str) -> ModelInfo | None: ...
```

### Subclass example

```python
from blazen import ModelRegistry, ModelInfo

class StaticRegistry(ModelRegistry):
    def __init__(self, models: list[ModelInfo]):
        super().__init__()
        self._models = {m.id: m for m in models}

    async def list_models(self) -> list[ModelInfo]:
        return list(self._models.values())

    async def get_model(self, model_id: str) -> ModelInfo | None:
        return self._models.get(model_id)
```

### Methods

| Method | Signature | Description |
|---|---|---|
| `list_models` | `async def list_models(self) -> list[ModelInfo]` | List every model the registry advertises. |
| `get_model` | `async def get_model(self, model_id: str) -> ModelInfo \| None` | Look up a single model. Return `None` if unknown. |

See `ModelInfo` for the dataclass shape returned by these methods.

---

## ModelPricing and Pricing Functions

### ModelPricing

Pricing metadata for cost tracking.

```python
from blazen import ModelPricing

pricing = ModelPricing(
    input_per_million=1.0,
    output_per_million=2.0,
    per_image=0.02,
    per_second=0.001,
)
```

| Property | Type | Description |
|---|---|---|
| `.input_per_million` | `float \| None` | Cost per million input tokens (USD). |
| `.output_per_million` | `float \| None` | Cost per million output tokens (USD). |
| `.per_image` | `float \| None` | Cost per generated image (USD). |
| `.per_second` | `float \| None` | Cost per second of compute (USD). |

### register_pricing()

Register custom pricing for a model. Overrides any existing pricing for the same model ID.

```python
from blazen import register_pricing, ModelPricing

register_pricing("my-model", ModelPricing(input_per_million=1.0, output_per_million=2.0))
```

### lookup_pricing()

Look up pricing for a model by ID. Returns `None` if the model is unknown.

```python
from blazen import lookup_pricing

pricing = lookup_pricing("gpt-4o")
if pricing:
    print(f"Input: ${pricing.input_per_million}/M tokens")
```

---

## LocalModel Methods on CompletionModel

`CompletionModel` instances backed by local inference (not remote APIs) support explicit load/unload lifecycle management.

| Method | Signature | Description |
|---|---|---|
| `load` | `await model.load()` | Load the model into memory/VRAM. Idempotent. |
| `unload` | `await model.unload()` | Free the model's memory/VRAM. Idempotent. |
| `is_loaded` | `await model.is_loaded() -> bool` | Whether the model is currently loaded. |
| `vram_bytes` | `await model.vram_bytes() -> int \| None` | Approximate memory footprint in bytes, or `None` if unknown. |

```python
model = CompletionModel.openai()  # Remote -- these methods are no-ops
# For a local model:
await model.load()
print(await model.is_loaded())    # True
print(await model.vram_bytes())   # e.g. 4_000_000_000
await model.unload()
```

---

## ProgressCallback

`ProgressCallback` is a subclassable abstract base class for receiving model-download progress notifications from `ModelCache.download(...)`. Subclass it and override `on_progress`; the default implementation is a no-op so subclasses can ignore total-byte updates they do not care about.

| Method | Signature | Description |
|---|---|---|
| `__new__` | `ProgressCallback()` | Construct an instance. Subclass to override `on_progress`. |
| `on_progress` | `on_progress(downloaded: int, total: int \| None) -> None` | Called repeatedly during a download with current byte counts. `total` is `None` if the size is unknown. No-op default. |

```python
from blazen import ProgressCallback

class MyProgress(ProgressCallback):
    def on_progress(self, downloaded: int, total: int | None) -> None:
        pct = (downloaded / total * 100) if total else 0
        print(f"{pct:.1f}%")

cb = MyProgress()
await cache.download("bert-base-uncased", "config.json", cb)
```

You may also pass a plain `Callable[[int, int | None], None]` to `download(progress=...)` for the same effect; this ABC simply gives type-checked callers a typed base to inherit from.

---

## Local Inference Types

The local-inference backends expose typed message, chunk, image, and result classes for use with the inherent `infer` / `infer_stream` APIs on each `*Provider`. There are three parallel families: the canonical un-prefixed `Inference*` types for **mistral.rs**, the `LlamaCpp*`-prefixed family for **llama.cpp**, and a single `CandleInferenceResult` for **Candle** (single-shot only -- no streaming).

Streaming on both mistral.rs and llama.cpp returns an async iterator (`InferenceChunkStream` and `LlamaCppInferenceChunkStream` respectively) that you consume with `async for chunk in stream: ...`. Both implement the `__aiter__` / `__anext__` protocol and terminate with `StopAsyncIteration` once the underlying engine stream is exhausted.

### mistral.rs inference types

| Class | Purpose |
|---|---|
| `ChatMessageInput` | A chat message for mistral.rs inference, optionally carrying image attachments. |
| `ChatRole` | Enum: `System`, `User`, `Assistant`, `Tool`. |
| `InferenceImage` | Image payload attached to a `ChatMessageInput`. Build with `from_bytes(...)` or `from_path(...)`. |
| `InferenceImageSource` | Underlying image source. Variants: `bytes` and `path`. Inspect via `kind` plus the per-variant getters. |
| `InferenceResult` | Result of a single non-streaming call: `content`, `reasoning_content`, `tool_calls`, `finish_reason`, `model`, `usage`. |
| `InferenceChunk` | Streaming delta: `delta`, `reasoning_delta`, `tool_calls`, `finish_reason`. |
| `InferenceChunkStream` | Async iterator over `InferenceChunk` items. Implements `__aiter__` / `__anext__`. |
| `InferenceToolCall` | A tool call returned by the engine: `id`, `name`, `arguments` (JSON string). |
| `InferenceUsage` | Token usage stats: `prompt_tokens`, `completion_tokens`, `total_tokens`, `total_time_sec`. |

```python
from blazen import ChatMessageInput, ChatRole, InferenceImage

msg = ChatMessageInput.with_images(
    role=ChatRole.User,
    text="Describe this image.",
    images=[InferenceImage.from_path("./photo.png")],
)
stream = await provider.infer_stream([msg])
async for chunk in stream:
    if chunk.delta:
        print(chunk.delta, end="")
```

### llama.cpp inference types

llama.cpp messages are text-only; multimodal inputs are not supported by this backend.

| Class | Purpose |
|---|---|
| `LlamaCppChatMessageInput` | A text-only chat message for llama.cpp inference. |
| `LlamaCppChatRole` | Enum: `System`, `User`, `Assistant`, `Tool`. |
| `LlamaCppInferenceResult` | Result of a single non-streaming call: `content`, `finish_reason`, `model`, `usage`. |
| `LlamaCppInferenceChunk` | Streaming delta: `delta`, `finish_reason`. |
| `LlamaCppInferenceChunkStream` | Async iterator over `LlamaCppInferenceChunk` items. Implements `__aiter__` / `__anext__`. |
| `LlamaCppInferenceUsage` | Token usage stats: `prompt_tokens`, `completion_tokens`, `total_tokens`, `total_time_sec`. |

```python
from blazen import LlamaCppChatMessageInput, LlamaCppChatRole

msg = LlamaCppChatMessageInput(role=LlamaCppChatRole.User, text="Hello")
stream = await llama_provider.infer_stream([msg])
async for chunk in stream:
    if chunk.delta:
        print(chunk.delta, end="")
```

### Candle inference types

Candle exposes a single non-streaming result type. Streaming is not currently supported on this backend.

| Class | Purpose |
|---|---|
| `CandleInferenceResult` | Result of a non-streaming candle call: `content`, `prompt_tokens`, `completion_tokens`, `total_time_secs`. |

```python
from blazen import CandleInferenceResult

result: CandleInferenceResult = await candle_provider.infer(messages)
print(result.content, result.prompt_tokens, result.completion_tokens)
```

---

## Telemetry

Blazen ships three optional exporters for tracing and metrics: `init_langfuse` (LLM observability), `init_otlp` (generic OpenTelemetry), and `init_prometheus` (HTTP-scraped metrics). Each is gated behind a Cargo feature and installs a global subscriber on first call -- invoke once at process startup, before any traced work. Calling more than one is allowed only in the documented order; install Langfuse before any other exporter if you want both.

| Function | Config type | Cargo feature | Purpose |
|---|---|---|---|
| `init_langfuse(config)` | `LangfuseConfig` | `langfuse` | LLM call traces, token usage, latency to Langfuse. |
| `init_otlp(config)` | `OtlpConfig` | `otlp` | Generic OpenTelemetry OTLP gRPC span exporter. |
| `init_prometheus(port)` | `int` (no config object) | `prometheus` | Prometheus metrics over HTTP `/metrics`. |

Initialization functions raise a `BlazenError` subclass on failure (e.g. an invalid endpoint or port already in use).

### LangfuseConfig

LLM-observability exporter for [Langfuse](https://langfuse.com). Behind the `langfuse` Cargo feature.

| Field | Type | Default | Description |
|---|---|---|---|
| `public_key` | `str` | required | Langfuse public API key (Basic-auth username). |
| `secret_key` | `str` | required | Langfuse secret API key (Basic-auth password). |
| `host` | `str \| None` | `None` | Langfuse host URL. Defaults to `https://cloud.langfuse.com`. |
| `batch_size` | `int` | `100` | Maximum events buffered before an automatic flush. |
| `flush_interval_ms` | `int` | `5000` | Background flush interval in milliseconds. |

```python
from blazen import LangfuseConfig, init_langfuse

init_langfuse(LangfuseConfig(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com",
))
```

`init_langfuse(config)` spawns a background tokio task that periodically flushes buffered LLM call traces, token usage, and latency data to the Langfuse ingestion API. If a global tracing subscriber is already installed, it is a soft failure: the underlying dispatcher is constructed (so background flushing still runs) and the function returns without overwriting the existing subscriber.

### OtlpConfig

Generic OpenTelemetry OTLP exporter. Behind the `otlp` Cargo feature.

| Field | Type | Default | Description |
|---|---|---|---|
| `endpoint` | `str` | required | OTLP gRPC endpoint URL (e.g. `"http://localhost:4317"`). |
| `service_name` | `str` | required | Service name reported to the backend. |

```python
from blazen import OtlpConfig, init_otlp

init_otlp(OtlpConfig(endpoint="http://localhost:4317", service_name="my-app"))
```

`init_otlp(config)` sets up an OTLP gRPC span exporter and installs a combined tracing subscriber (env-filter + OTel layer + fmt layer).

### Prometheus metrics

Behind the `prometheus` Cargo feature.

```python
from blazen import init_prometheus

init_prometheus(9090)  # serves /metrics on 0.0.0.0:9090
```

`init_prometheus(port)` installs a global `metrics` recorder backed by Prometheus and starts an HTTP listener on `0.0.0.0:{port}` serving the `/metrics` endpoint. After calling this, any code using the `metrics` macros (`counter!`, `histogram!`, `gauge!`) is exposed at the Prometheus endpoint.

---

## Error Handling

All Blazen failures are raised as subclasses of `BlazenError`, a typed exception hierarchy rooted at `builtins.Exception`. Catching `BlazenError` will catch every error raised from any Blazen API; catch a specific subclass to react to one category.

```python
from blazen import BlazenError, RateLimitError, ProviderError, AuthError

try:
    response = await model.complete([ChatMessage.user("Hello")])
except RateLimitError as e:
    # Provider rate-limited the request -- back off and retry.
    print(f"slow down: {e}")
except AuthError as e:
    print(f"bad credentials: {e}")
except BlazenError as e:
    print(f"blazen failed: {e}")
```

### Base hierarchy

Every class below derives directly from `BlazenError`, which itself derives from `builtins.Exception`.

| Class | Description |
|---|---|
| `BlazenError` | Base class for all Blazen runtime errors. Catches everything. |
| `AuthError` | Authentication failed (invalid or missing API key). |
| `RateLimitError` | Provider rate-limited the request. |
| `TimeoutError` | The operation exceeded its time limit. Distinct from the builtin `TimeoutError` -- import from `blazen`. |
| `ValidationError` | Invalid input rejected before the provider round-trip. |
| `ContentPolicyError` | Provider rejected the request for policy reasons. |
| `ProviderError` | Provider-side error. Carries structured HTTP attributes (see below). |
| `UnsupportedError` | Requested capability is not supported by this provider or backend. |
| `ComputeError` | Compute job error (cancelled, quota exceeded, etc). |
| `MediaError` | Media handling error (invalid input, size exceeded, etc). |

### ProviderError attributes

For HTTP-attributable failures, `ProviderError` instances are populated with the following typed attributes (set via `setattr` on the instance, mirrored in the stub for static type-checking):

| Attribute | Type | Description |
|---|---|---|
| `provider` | `str` | Provider tag (e.g. `"fal"`, `"openrouter"`). |
| `status` | `int \| None` | HTTP status code; `None` for non-HTTP provider errors. |
| `endpoint` | `str \| None` | Request URL. |
| `request_id` | `str \| None` | `x-fal-request-id` / `x-request-id` header if present. |
| `detail` | `str \| None` | Parsed error message extracted from the JSON body. |
| `raw_body` | `str \| None` | Response body, capped at 4 KiB. |
| `retry_after_ms` | `int \| None` | Parsed `Retry-After` header in milliseconds. |

```python
from blazen import ProviderError

try:
    await model.complete([ChatMessage.user("Hello")])
except ProviderError as e:
    print(f"{e.provider} {e.status} on {e.endpoint}")
    if e.retry_after_ms:
        await asyncio.sleep(e.retry_after_ms / 1000)
```

### Per-backend ProviderError subclasses

Each local-inference backend raises its own `ProviderError` subclass so callers can route errors per-backend. The classes are always declared in the type stub, but only registered at runtime when the corresponding Cargo feature is enabled in your wheel build.

| Class | Cargo feature | Backend |
|---|---|---|
| `LlamaCppError` | `llamacpp` | llama.cpp local LLM inference. |
| `CandleLlmError` | `candle-llm` | Candle local LLM inference. |
| `CandleEmbedError` | `candle-embed` | Candle local embedding backend. |
| `MistralRsError` | `mistralrs` | mistral.rs local LLM inference. |
| `WhisperError` | `whispercpp` | whisper.cpp transcription. |
| `PiperError` | `piper` | Piper text-to-speech. |
| `DiffusionError` | `diffusion` | Diffusion image generation. |
| `FastEmbedError` | `embed` | fastembed embedding (non-musl only). |
| `TractError` | `tract` | Tract ONNX embedding. |

Because they all derive from `ProviderError`, `except ProviderError` catches every backend-attributable error, including these subclasses. Use `except LlamaCppError` (etc.) when you need to react to a single backend.

```python
from blazen import LlamaCppError, ProviderError

try:
    await llama_provider.infer(messages)
except LlamaCppError as e:
    print(f"llama.cpp blew up: {e}")
except ProviderError as e:
    # Any other backend
    print(f"{e.provider}: {e}")
```

---

# Node.js API Reference

Source: https://blazen.dev/docs/api/node
Language: node
Section: api

## ChatMessage

A class for building typed chat messages. Supports text, multimodal (image) content, and content parts.

### Constructor

```typescript
new ChatMessage({ role?: string, content?: string, parts?: ContentPart[] })
```

Create a message from an options object. `role` defaults to `"user"` if omitted. Supply either `content` (text) or `parts` (multimodal), not both.

```typescript
// Text message with explicit role
new ChatMessage({ role: "user", content: "Hello" })

// Using the Role enum
new ChatMessage({ role: Role.User, content: "Hello" })

// System message
new ChatMessage({ role: "system", content: "You are a helpful assistant." })

// Multimodal message with content parts
new ChatMessage({
  role: "user",
  parts: [
    { partType: "text", text: "Describe this image:" },
    { partType: "image", image: { source: { sourceType: "url", url: "https://example.com/photo.jpg" } } }
  ]
})
```

### Static Factory Methods

| Method | Description |
|---|---|
| `ChatMessage.system(content: string)` | Create a system message |
| `ChatMessage.user(content: string)` | Create a user message |
| `ChatMessage.assistant(content: string)` | Create an assistant message |
| `ChatMessage.tool(content: string)` | Create a tool result message |
| `ChatMessage.userImageUrl(text: string, url: string, mediaType?: string)` | Create a user message with text and an image URL |
| `ChatMessage.userImageBase64(text: string, data: string, mediaType: string)` | Create a user message with text and a base64-encoded image |
| `ChatMessage.userParts(parts: ContentPart[])` | Create a user message from an explicit list of content parts |

```typescript
const msg = ChatMessage.user("What is 2 + 2?");

const imgMsg = ChatMessage.userImageUrl(
  "What's in this image?",
  "https://example.com/photo.jpg",
  "image/jpeg"
);

const b64Msg = ChatMessage.userImageBase64(
  "Describe this:",
  base64Data,
  "image/png"
);
```

### Properties

| Property | Type | Description |
|---|---|---|
| `.role` | `string` | The message role: `"system"`, `"user"`, `"assistant"`, or `"tool"` |
| `.content` | `string \| null` | The text content of the message, if any. For tool results that returned a plain string, the string lives here (and `.toolResult` is `null`). |
| `.toolResult` | `ToolOutput \| null` | Structured tool-result payload. `null` for non-tool messages or when the tool returned a plain string. When non-null, `.toolResult.data` is the full structured value the caller should consume; the LLM-facing wire form is derived from `.toolResult.llmOverride` (if set) or from `.toolResult.data` via the provider's default conversion. |

> **Note on `toolCallId`**: The Rust core stores a `tool_call_id` on tool messages so providers can correlate a tool result with the originating `ToolCall.id`. As of this writing the Node binding does not expose a public `.toolCallId` getter on `ChatMessage` -- correlation is handled internally by the agent loop and on the wire by each provider. If you need explicit access, file an issue and we can surface it.

---

## Role

String enum for message roles.

```typescript
Role.System    // "system"
Role.User      // "user"
Role.Assistant // "assistant"
Role.Tool      // "tool"
```

Can be used interchangeably with plain strings in the `ChatMessage` constructor.

---

## ContentPart

Types for multimodal message content, used in the `parts` field of the `ChatMessage` constructor and in `ChatMessage.userParts()`.

```typescript
interface ContentPart {
  partType: "text" | "image";
  text?: string;          // Required when partType is "text"
  image?: ImageContent;   // Required when partType is "image"
}

interface ImageContent {
  source: ImageSource;
  mediaType?: string;     // MIME type, e.g. "image/png"
}

interface ImageSource {
  sourceType: "url" | "base64";
  url?: string;           // Required when sourceType is "url"
  data?: string;          // Required when sourceType is "base64"
}

// `MediaSource` is a type alias for `ImageSource` re-exported for compute APIs
// that accept any media (image / video frame / audio cover-art) using the same shape.
// All compute requests that take a `MediaSource` accept exactly the same value
// you would pass to `ImageSource` — `MediaSource` exists only as an aliasing affordance.
export type MediaSource = ImageSource;
```

> **Note on `MediaSource`:** prefer `MediaSource` in compute / generation APIs (image upscaling, video frames, audio cover art) and `ImageSource` in chat-message APIs. The two are interchangeable at the type level — `MediaSource` is just the structurally identical alias for `ImageSource`.

```typescript
// Text part
{ partType: "text", text: "Describe this image:" }

// Image from URL
{
  partType: "image",
  image: {
    source: { sourceType: "url", url: "https://example.com/photo.jpg" },
    mediaType: "image/jpeg"
  }
}

// Image from base64
{
  partType: "image",
  image: {
    source: { sourceType: "base64", data: "iVBORw0KGgo..." },
    mediaType: "image/png"
  }
}
```

---

## CompletionModel

A chat completion model. Created via static factory methods for each provider.

### Provider Factory Methods

All providers accept an optional options object containing an `apiKey` (and other provider-specific fields). If `options` is omitted -- or `apiKey` is not set within it -- the key is read from the provider's standard environment variable (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, etc.). If `model` is not set, the provider's default model is used.

| Method | Signature |
|---|---|
| `CompletionModel.openai` | `(options?: ProviderOptions)` |
| `CompletionModel.anthropic` | `(options?: ProviderOptions)` |
| `CompletionModel.gemini` | `(options?: ProviderOptions)` |
| `CompletionModel.azure` | `(options: AzureOptions)` |
| `CompletionModel.fal` | `(options?: FalOptions)` |
| `CompletionModel.openrouter` | `(options?: ProviderOptions)` |
| `CompletionModel.groq` | `(options?: ProviderOptions)` |
| `CompletionModel.together` | `(options?: ProviderOptions)` |
| `CompletionModel.mistral` | `(options?: ProviderOptions)` |
| `CompletionModel.deepseek` | `(options?: ProviderOptions)` |
| `CompletionModel.fireworks` | `(options?: ProviderOptions)` |
| `CompletionModel.perplexity` | `(options?: ProviderOptions)` |
| `CompletionModel.xai` | `(options?: ProviderOptions)` |
| `CompletionModel.cohere` | `(options?: ProviderOptions)` |
| `CompletionModel.bedrock` | `(options: BedrockOptions)` |

```typescript
// Read key from OPENAI_API_KEY env var
const model = CompletionModel.openai();

// Pass an explicit key and override the model
const claude = CompletionModel.anthropic({ apiKey: "sk-ant-...", model: "claude-sonnet-4-20250514" });
const gemini = CompletionModel.gemini({ model: "gemini-2.5-flash" });
```

### Properties

| Property | Type | Description |
|---|---|---|
| `.modelId` | `string` | The model identifier string |

### `await model.complete(messages: ChatMessage[]): CompletionResponse`

Perform a chat completion.

```typescript
const response = await model.complete([
  ChatMessage.system("You are a helpful assistant."),
  ChatMessage.user("What is 2 + 2?"),
]);
console.log(response.content); // "4"
```

### `await model.completeWithOptions(messages: ChatMessage[], options: CompletionOptions): CompletionResponse`

Perform a chat completion with additional options.

```typescript
const response = await model.completeWithOptions(
  [ChatMessage.user("Write a haiku about Rust.")],
  { temperature: 0.7, maxTokens: 100 }
);
```

### `await model.stream(messages: ChatMessage[], onChunk: (chunk) => void): void`

Stream a chat completion. The callback receives each chunk as it arrives.

```typescript
await model.stream(
  [ChatMessage.user("Tell me a story")],
  (chunk) => {
    if (chunk.delta) process.stdout.write(chunk.delta);
  }
);
```

Each chunk has the shape:

```typescript
{
  delta?: string;              // Text content delta
  finishReason?: string;       // Set on the final chunk
  toolCalls: ToolCall[];       // Tool calls, if any
}
```

### `await model.streamWithOptions(messages: ChatMessage[], onChunk: (chunk) => void, options: CompletionOptions): void`

Stream a chat completion with additional options.

```typescript
await model.streamWithOptions(
  [ChatMessage.user("Explain quantum computing")],
  (chunk) => { if (chunk.delta) process.stdout.write(chunk.delta); },
  { temperature: 0.5, maxTokens: 500 }
);
```

### Middleware Decorators

Each decorator returns a new `CompletionModel` wrapping the original with additional behaviour.

#### `model.withRetry(config?: RetryConfig): CompletionModel`

Automatic retry with exponential backoff on transient failures.

```typescript
const resilient = model.withRetry({ maxRetries: 5, initialDelayMs: 500, maxDelayMs: 15000 });
```

| Field | Type | Default | Description |
|---|---|---|---|
| `maxRetries` | `number` | `3` | Maximum retry attempts. |
| `initialDelayMs` | `number` | `1000` | Delay before first retry (ms). |
| `maxDelayMs` | `number` | `30000` | Upper bound on backoff delay (ms). |

#### `model.withCache(config?: CacheConfig): CompletionModel`

In-memory response cache for identical non-streaming requests.

```typescript
const cached = model.withCache({ ttlSeconds: 600, maxEntries: 500 });
```

| Field | Type | Default | Description |
|---|---|---|---|
| `ttlSeconds` | `number` | `300` | Cache entry TTL in seconds. |
| `maxEntries` | `number` | `1000` | Maximum entries before eviction. |

#### `CompletionModel.withFallback(models: CompletionModel[]): CompletionModel`

Static factory method. Tries providers in order; falls back on transient errors.

```typescript
const model = CompletionModel.withFallback([
  CompletionModel.openai(),
  CompletionModel.anthropic(),
]);
```

---

## CompletionOptions

Options object for `completeWithOptions()` and `streamWithOptions()`.

```typescript
interface CompletionOptions {
  temperature?: number;        // Sampling temperature (0.0 - 2.0)
  maxTokens?: number;          // Maximum tokens to generate
  topP?: number;               // Nucleus sampling parameter
  model?: string;              // Override the default model ID
  tools?: ToolDefinition[];    // Tool definitions for function calling
}
```

---

## CompletionResponse

Returned by `model.complete()` and `model.completeWithOptions()`.

```typescript
interface CompletionResponse {
  content?: string;                // The generated text
  toolCalls: ToolCall[];           // Tool calls requested by the model
  usage?: TokenUsage;              // Token usage statistics
  model: string;                   // Model name used for the completion
  finishReason?: string;           // Why generation stopped ("stop", "tool_calls", etc.)
  cost?: number;                   // Cost in USD, if reported by the provider
  timing?: RequestTiming;          // Request timing breakdown
  images: object[];                // Generated images, if any (provider-specific)
  audio: object[];                 // Generated audio, if any (provider-specific)
  videos: object[];                // Generated videos, if any (provider-specific)
  metadata: object;                // Raw provider-specific metadata
}
```

---

## ToolCall

A tool invocation requested by the model.

| Property | Type | Description |
|---|---|---|
| `.id` | `string` | Unique identifier for the tool call |
| `.name` | `string` | Name of the tool to invoke |
| `.arguments` | `object` | Parsed JSON arguments |

---

## ToolOutput

Two-channel return shape for tool handlers. Tool results have two distinct audiences. The caller (your TypeScript code) wants the full structured data; the LLM, on the next turn, may need a different shape -- sometimes shorter, sometimes provider-specific. `ToolOutput` carries both channels.

```typescript
import type { ToolOutput, LlmPayload } from "blazen";

const out: ToolOutput = {
  data: { items: [1, 2, 3] },
};
```

### Properties

| Member | Type | Description |
|---|---|---|
| `data` | `any` | The structured value the caller sees programmatically. Dict, array, scalar, or string. |
| `llmOverride` | `LlmPayload \| undefined` | Optional override for what the LLM sees on the next turn. `undefined` means each provider applies its default conversion from `data`. |

Both `llmOverride` (camelCase) and `llm_override` (snake_case) are accepted on input from a JS tool handler, so a hand-written object using either casing will deserialize correctly.

When the agent loop appends a tool result to the conversation, the resulting `ChatMessage` exposes the structured output via `.toolResult`:

```typescript
const last = result.messages[result.messages.length - 1];
// last.toolResult?.data is the full structured payload from your handler.
// last.toolResult?.llmOverride is the override (if any) the LLM saw this turn.
```

For tool handlers that returned a plain string, `last.toolResult` is `null` and the string lives on `last.content` instead.

---

## LlmPayload

A tagged union describing what the LLM sees for a tool result on the next turn. Used as the `llmOverride` field of [`ToolOutput`](#tooloutput).

```typescript
import type { LlmPayload } from "blazen";

const text: LlmPayload = { kind: "text", text: "Found 3 results." };

const json: LlmPayload = { kind: "json", value: { items: [1, 2, 3] } };

const parts: LlmPayload = {
  kind: "parts",
  parts: [
    { partType: "text", text: "Here is the table:" },
    { partType: "text", text: "| col |\n| --- |\n| 1 |" },
  ],
};

const raw: LlmPayload = {
  kind: "provider_raw",
  provider: "anthropic",
  value: [{ type: "text", text: "Custom Anthropic-shaped payload." }],
};
```

### Variants

| `kind` | Required fields | Behavior |
|---|---|---|
| `"text"` | `text` | Plain text. Works on every provider. |
| `"json"` | `value` | Structured JSON. Anthropic and Gemini natively consume the structure; OpenAI-family stringifies once at the wire boundary. |
| `"parts"` | `parts` (`ContentPart[]`) | Multimodal content blocks; Anthropic supports natively, OpenAI falls back to text concatenation, Gemini wraps as `{ parts: [...] }`. |
| `"provider_raw"` | `provider`, `value` | Provider-specific escape hatch. Only the named provider sees `value`; every other provider falls back to converting `ToolOutput.data` with its default. `provider` is one of `"openai"`, `"openai_compat"`, `"azure"`, `"anthropic"`, `"gemini"`, `"responses"`, `"fal"`. |

### Per-provider behavior

When a tool returns structured `data` and no `llmOverride`, each provider sends a sensible default to the LLM:

- **OpenAI / OpenAI-compat / Azure / Responses / Fal**: the data is JSON-stringified into the `content` field of the tool message.
- **Anthropic**: structured data becomes `[{ type: "text", text: <stringified-json> }]` inside `tool_result.content`.
- **Gemini**: structured object data is passed natively as `functionResponse.response`. Scalars wrap as `{ result: <scalar> }`.

When `llmOverride` is set, that override always wins for the variants the provider understands; `kind: "provider_raw"` is the strictest -- it's only honoured when `provider` matches the active provider, otherwise the provider falls back to converting `data` with its default.

---

## ToolDefinition

Describes a tool that the model may invoke.

```typescript
interface ToolDefinition {
  name: string;            // Unique tool name
  description: string;     // Human-readable description
  parameters: object;      // JSON Schema for the tool's parameters
}
```

```typescript
const tools: ToolDefinition[] = [
  {
    name: "getWeather",
    description: "Get the current weather for a city",
    parameters: {
      type: "object",
      properties: { city: { type: "string" } },
      required: ["city"]
    }
  }
];
```

---

## Content Subsystem

Pluggable storage and handle plumbing for multimodal payloads (images, audio, video, documents, 3D models, CAD files). A `ContentStore` issues opaque handles; tools accept handle-id strings as arguments and Blazen substitutes the resolved typed content before the handler runs.

```typescript
import {
  ContentStore,
  imageInput,
  audioInput,
  videoInput,
  fileInput,
  threeDInput,
  cadInput,
} from "blazen";
import type {
  ContentHandle,
  ContentKind,
  JsContentMetadata,
  PutOptions,
} from "blazen";
```

### ContentKind

String enum tagging what a handle refers to. Values match the serde-tag form so they round-trip across any Blazen API that takes a kind string.

```typescript
const enum ContentKind {
  Image = "image",
  Audio = "audio",
  Video = "video",
  Document = "document",
  ThreeDModel = "three_d_model",
  Cad = "cad",
  Archive = "archive",
  Font = "font",
  Code = "code",
  Data = "data",
  Other = "other",
}
```

### ContentHandle

Opaque reference returned by `ContentStore.put()` and consumed by every other store method. Treat `id` as a black box -- store-defined.

```typescript
interface ContentHandle {
  id: string;             // Opaque, store-defined identifier
  kind: ContentKind;      // What kind of content this handle refers to
  mimeType?: string;      // MIME type if known
  byteSize?: number;      // Byte size if known (i64 -- napi has no u64)
  displayName?: string;   // Human-readable display name (e.g. original filename)
}
```

`ContentHandle` is a type alias for the underlying `JsContentHandle` interface.

### ContentStore

Pluggable registry for multimodal content. Construct via the static factories; instances are cheap to clone (internally an `Arc`), so reusing one store across multiple agents and requests is fine.

```typescript
class ContentStore {
  // Factories
  static inMemory(): ContentStore;
  static localFile(root: string): ContentStore;
  static openaiFiles(apiKey: string, baseUrl?: string | null): ContentStore;
  static anthropicFiles(apiKey: string, baseUrl?: string | null): ContentStore;
  static geminiFiles(apiKey: string, baseUrl?: string | null): ContentStore;
  static falStorage(apiKey: string, baseUrl?: string | null): ContentStore;
  static custom(options: CustomContentStoreOptions): ContentStore;

  // Operations
  put(body: Buffer | string, options: PutOptions): Promise<ContentHandle>;
  resolve(handle: ContentHandle): Promise<any>;          // serialized MediaSource
  fetchBytes(handle: ContentHandle): Promise<Buffer>;
  metadata(handle: ContentHandle): Promise<JsContentMetadata>;
  delete(handle: ContentHandle): Promise<void>;
}
```

`put()` accepts either a `Buffer` (inline bytes uploaded to the store) or a `string` -- interpreted as a URL when it contains `"://"` (the store records the reference) and as a local filesystem path otherwise (the store reads or copies the file as needed).

`resolve()` returns a serialized `MediaSource` -- the same JSON shape Blazen's request builders accept. `fetchBytes()` is for tools that need to operate on the raw payload (parse a PDF, transcribe audio); most tools reason over the handle and let `resolve()` produce the wire form. `delete()` is best-effort -- default implementations on most stores are no-ops.

```typescript
const store = ContentStore.inMemory();

const handle = await store.put(Buffer.from(pngBytes), {
  mimeType: "image/png",
  kind: ContentKind.Image,
  displayName: "diagram.png",
});

const meta = await store.metadata(handle);
const bytes = await store.fetchBytes(handle);
```

### Subclassing `ContentStore`

`ContentStore` is subclassable from JavaScript / TypeScript. Override the methods your backend needs; napi-rs wraps your subclass in a Rust adapter that dispatches into your JS async functions via `ThreadsafeFunction`.

```typescript
import { ContentStore } from "blazen";
import type { ContentHandle, ContentKind } from "blazen";

class S3ContentStore extends ContentStore {
  constructor(bucket: string) {
    super();
    this.bucket = bucket;
  }

  async put(body, hint) {
    // ...
    return { id: "...", kind: "image" };
  }

  async resolve(handle) {
    return { sourceType: "url", url: "..." };
  }

  async fetchBytes(handle) {
    return Buffer.from("...");
  }

  // Optional:
  async fetchStream(handle) { return Buffer.from("..."); }
  async delete(handle) { /* no-op */ }
}
```

Subclasses MUST override `put`, `resolve`, `fetchBytes`. The base-class default impls throw an error so any missing override is a clear failure rather than silent recursion via `super()`.

### `ContentStore.custom({...})`

Callback-based factory. Direct JS mirror of Rust `CustomContentStore::builder`.

```typescript
ContentStore.custom(options: {
  put: (body: any, hint: any) => Promise<ContentHandle>;
  resolve: (handle: ContentHandle) => Promise<any>;        // serialized MediaSource
  fetchBytes: (handle: ContentHandle) => Promise<Buffer>;
  fetchStream?: (handle: ContentHandle) => Promise<Buffer>; // single-chunk for now
  delete?: (handle: ContentHandle) => Promise<void>;
  name?: string;
}): ContentStore
```

`put`, `resolve`, `fetchBytes` are required. `fetchStream` and `delete` are optional. The `body` argument arrives as a JS object shaped like `{type: "bytes", data: [...]}` / `{type: "url", url}` / `{type: "local_path", path}` / `{type: "provider_file", provider, id}` / `{type: "stream", stream: AsyncIterable<Uint8Array>, sizeHint: number | null}`. The `hint` has optional `mimeType` / `kindHint` / `displayName` / `byteSize`.

`resolve` returns a serialized `MediaSource` JS object (e.g. `{sourceType: "url", url: "..."}`). `fetchBytes` returns a `Buffer`. `fetchStream` may return either `Buffer` / `Uint8Array` / `number[]` / base64 `string` (legacy, single-chunk) or an `AsyncIterable<Uint8Array>` for true chunk-by-chunk streaming -- a Node `Readable` qualifies since it implements `[Symbol.asyncIterator]`.

### PutOptions

Optional hints attached to a `put()` call. Every field is optional; the store may auto-detect from the bytes when a hint is missing.

```typescript
interface PutOptions {
  mimeType?: string;      // MIME type, if known
  kind?: ContentKind;     // Caller's preferred classification -- overrides any auto-detection
  displayName?: string;   // Human-readable display name (filename, caption)
  byteSize?: number;      // Byte size, if known up-front (i64 since napi has no u64)
}
```

### ContentMetadata

Cheap metadata summary returned by `ContentStore.metadata()`. No bytes are materialized.

```typescript
interface ContentMetadata {
  kind: ContentKind;
  mimeType?: string;
  byteSize?: number;
  displayName?: string;
}
```

`ContentMetadata` is a type alias for the underlying `JsContentMetadata` interface.

### Built-in stores

| Factory | Purpose |
|---|---|
| `ContentStore.inMemory()` | Ephemeral in-memory store. Default for tests and short-lived sessions. |
| `ContentStore.localFile(root)` | Filesystem-backed store rooted at `root`. Directory is created if missing. |
| `ContentStore.openaiFiles(apiKey, baseUrl?)` | Backed by the OpenAI Files API. |
| `ContentStore.anthropicFiles(apiKey, baseUrl?)` | Backed by the Anthropic Files API. |
| `ContentStore.geminiFiles(apiKey, baseUrl?)` | Backed by the Gemini Files API. |
| `ContentStore.falStorage(apiKey, baseUrl?)` | Backed by fal.ai's storage API. |
| `ContentStore.custom({...})` | User-defined backend via async callbacks (see above). |

### Tool-input schema helpers

Each helper builds a JSON Schema fragment declaring a single required handle-id input. The model emits a handle-id string; Blazen swaps it for the resolved typed content before your tool runs.

| Helper | Declares an input expecting a handle of kind |
|---|---|
| `imageInput(name, description)` | `image` |
| `audioInput(name, description)` | `audio` |
| `videoInput(name, description)` | `video` |
| `fileInput(name, description)` | `document` |
| `threeDInput(name, description)` | `three_d_model` |
| `cadInput(name, description)` | `cad` |

Each call returns the same shape (kind tag varies):

```typescript
imageInput("photo", "The image to describe");
// =>
// {
//   type: "object",
//   properties: {
//     photo: {
//       type: "string",
//       description: "The image to describe",
//       "x-blazen-content-ref": { kind: "image" }
//     }
//   },
//   required: ["photo"]
// }
```

The `x-blazen-content-ref` extension is invisible to providers -- they see a plain string parameter -- but Blazen's resolver uses it as a marker for handle substitution.

### How resolution works

When the model emits a tool call like `{"photo": "blazen_xxx"}`, Blazen scans the tool's parameter schema for `x-blazen-content-ref` markers. For each marked field, it looks up the handle id in the active `ContentStore` and replaces the bare string with a typed object before invoking your handler:

```typescript
{
  kind: "image",
  handleId: "blazen_xxx",
  mimeType: "image/png",
  byteSize: 24576,
  displayName: "diagram.png",
  source: { /* resolved MediaSource -- url, base64, or provider-native ref */ }
}
```

`source` is the same wire form `ContentStore.resolve()` returns, so your handler can forward it straight into a downstream provider request, or call `store.fetchBytes(handle)` if it needs the raw payload. If the handle is unknown to the store the call fails before the handler runs.

---

## TokenUsage

Token usage statistics for a completion request.

| Property | Type | Description |
|---|---|---|
| `.promptTokens` | `number` | Tokens in the prompt |
| `.completionTokens` | `number` | Tokens in the completion |
| `.totalTokens` | `number` | Total tokens used |

---

## RequestTiming

Timing metadata for a completion request.

| Property | Type | Description |
|---|---|---|
| `.queueMs` | `number \| undefined` | Time spent waiting in queue (ms) |
| `.executionMs` | `number \| undefined` | Time spent executing (ms) |
| `.totalMs` | `number \| undefined` | Total wall-clock time (ms) |

---

## runAgent

Run an agentic tool execution loop. The agent repeatedly calls the model, executes tool calls via the handler callback, feeds results back, and repeats until the model stops calling tools or `maxIterations` is reached.

```typescript
const result = await runAgent(model, messages, tools, toolHandler, options?);
```

### Parameters

| Parameter | Type | Description |
|---|---|---|
| `model` | `CompletionModel` | The completion model to use |
| `messages` | `ChatMessage[]` | Initial conversation messages |
| `tools` | `ToolDef[]` | Tool definitions the agent can invoke |
| `toolHandler` | `(toolName: string, args: object) => Promise<any \| ToolOutput>` | Callback that executes tool calls. May return a bare value (auto-wrapped) or an explicit [`ToolOutput`](#tooloutput). |
| `options` | `AgentRunOptions?` | Optional configuration |

### Example

```typescript
import { CompletionModel, ChatMessage, runAgent } from "blazen";

const model = CompletionModel.openai();

const result = await runAgent(
  model,
  [ChatMessage.user("What is the weather in NYC?")],
  [{
    name: "getWeather",
    description: "Get weather for a city",
    parameters: {
      type: "object",
      properties: { city: { type: "string" } },
      required: ["city"]
    }
  }],
  async (toolName, args) => {
    if (toolName === "getWeather") {
      return { temp: 72, condition: "sunny" };
    }
    throw new Error(`Unknown tool: ${toolName}`);
  },
  { maxIterations: 5 }
);

console.log(result.response.content);
console.log(`Took ${result.iterations} iterations`);
```

### Tool handler return shapes

The handler's return value is checked for an explicit `data` key. If present and the value deserializes as a [`ToolOutput`](#tooloutput), it's used directly. Otherwise the bare value is wrapped via `ToolOutput { data: <value>, llmOverride: undefined }`.

This means an arbitrary user dict like `{ items: [1, 2, 3] }` is treated as plain `data`, not as a `ToolOutput`. Only objects with a top-level `data` field are unpacked.

```typescript
import { runAgent, type ToolDef } from "blazen";

// Simplest: return a value directly, auto-wrapped.
const search: ToolDef = {
  name: "search",
  description: "Search for items.",
  parameters: { type: "object", properties: { q: { type: "string" } } },
};

async function handlerSimple(toolName: string, args: any) {
  if (toolName === "search") {
    return { items: [1, 2, 3] }; // wrapped as ToolOutput { data: { items: [1,2,3] } }
  }
  throw new Error(`Unknown tool: ${toolName}`);
}

// With override: structured ToolOutput so the LLM sees a summary,
// but the caller's `messages[messages.length-1].toolResult.data`
// still has the full list.
async function handlerWithOverride(toolName: string, args: any) {
  if (toolName === "search") {
    return {
      data: { items: [1, 2, 3], rawResponse: "..." },
      llmOverride: { kind: "text", text: "Found 3 items." },
    };
  }
  throw new Error(`Unknown tool: ${toolName}`);
}
```

To return a string to the caller (so it lives on `ChatMessage.content` and `toolResult` is `null`), simply return a string from the handler:

```typescript
async function handlerString(toolName: string, args: any) {
  return "ok"; // appears as ChatMessage.content; toolResult is null
}
```

See [ToolOutput](#tooloutput) and [LlmPayload](#llmpayload) for the full shape and per-provider wire behaviour.

---

## ToolDef

Describes a tool that the agent may invoke.

```typescript
interface ToolDef {
  name: string;            // Unique tool name
  description: string;     // Human-readable description
  parameters: object;      // JSON Schema for the tool's parameters
}
```

---

## AgentRunOptions

Options for configuring an agent run.

```typescript
interface AgentRunOptions {
  maxIterations?: number;    // Max tool-calling iterations (default: 10)
  systemPrompt?: string;     // System prompt prepended to the conversation
  temperature?: number;      // Sampling temperature (0.0 - 2.0)
  maxTokens?: number;        // Maximum tokens per completion call
  addFinishTool?: boolean;   // Add a built-in "finish" tool the model can call to signal completion
}
```

---

## AgentResult

The result of an agent run, returned by `runAgent()`. `AgentResult` is a typed class with getter properties (not a plain object) so it carries identity across the FFI boundary and supports `instanceof` checks.

### Properties

| Property | Type | Description |
|---|---|---|
| `.response` | `string` | Final assistant text from the last completion. |
| `.messages` | `ChatMessage[]` | Full message history (all tool calls and results). |
| `.iterations` | `number` | Number of tool-calling iterations that occurred. |
| `.totalCost` | `number \| null` | Aggregated cost in USD across all iterations, or `null` if pricing is unknown. |
| `.toString()` | `string` | Human-readable summary, mirrors the Python `__repr__`. |

```typescript
const result = await runAgent(model, [ChatMessage.user("Hi")], tools);
console.log(result.response);            // "Hello!"
console.log(result.messages.length);     // e.g. 3
console.log(result.iterations);          // e.g. 1
console.log(result.totalCost);           // e.g. 0.00012 or null
```

---

## BatchResult

Returned by `completeBatch()` / `completeBatchConfig()`. A typed class wrapping per-request outcomes plus aggregates.

### Properties

| Property | Type | Description |
|---|---|---|
| `.responses` | `(CompletionResponse \| null)[]` | One entry per input request. `null` for failed requests. |
| `.errors` | `(string \| null)[]` | Per-request error message, or `null` for successful requests. |
| `.totalUsage` | `TokenUsage \| null` | Aggregated token usage across all successful responses. |
| `.totalCost` | `number \| null` | Aggregated cost in USD across all successful responses. |
| `.successCount` | `number` | Number of requests that succeeded. |
| `.failureCount` | `number` | Number of requests that failed. |
| `.length` | `number` | Total number of requests in the batch (= responses.length = errors.length). |
| `.toString()` | `string` | Human-readable batch summary. |

```typescript
const batch = await completeBatch(model, [
  [ChatMessage.user("What's 2+2?")],
  [ChatMessage.user("Capital of France?")],
]);
console.log(`${batch.successCount}/${batch.length} succeeded`);
for (let i = 0; i < batch.length; i++) {
  if (batch.errors[i]) console.error(`req ${i}:`, batch.errors[i]);
  else console.log(`req ${i}:`, batch.responses[i]?.content);
}
```

---

## Pipeline

A `Pipeline` is a sequence of named `Stage`s built with `PipelineBuilder`. Each stage runs as its own workflow; on completion, an optional persist callback fires with a typed snapshot so the caller can durably store progress for later resumption.

### `new PipelineBuilder(name: string)`

```typescript
import { PipelineBuilder } from "blazen";

const pipeline = new PipelineBuilder("ingest")
  .stage(stageA)
  .stage(stageB)
  .timeoutPerStage(120)
  .build();
```

### `.onPersist(callback)`

Register a TSFN-based persist callback that receives a typed `PipelineSnapshot` after each stage completes. The callback must return `Promise<void>` (or be `async`). A rejected promise aborts the pipeline with a `PipelineError`.

```typescript
import { PipelineBuilder, PipelineSnapshot } from "blazen";

const pipeline = new PipelineBuilder("ingest")
  .stage(stage)
  .onPersist(async (snapshot: PipelineSnapshot) => {
    await db.put(`pipeline:${snapshot.runId}`, snapshot.toJsonPretty());
  })
  .build();
```

### `.onPersistJson(callback)`

Same as `onPersist`, but the callback receives the snapshot pre-serialized as a JSON string. Useful for backends that store opaque blobs (IndexedDB, Redis, S3).

```typescript
const pipeline = new PipelineBuilder("ingest")
  .stage(stage)
  .onPersistJson(async (json: string) => {
    await idb.put("snapshots", json, runId);
  })
  .build();
```

The snapshot can later be replayed via `pipeline.resume(PipelineSnapshot.fromJson(json))`.

If your `onPersist` (or `onPersistJson`) callback throws or returns a rejected promise, the rejection is wrapped as a `PersistError` (a `BlazenError` subclass) and propagated to the running pipeline, aborting it. Catch the error from `await handler.result()` and inspect with `instanceof PersistError` to distinguish persistence failures from stage failures.

```typescript
import { PersistError } from "blazen";

try {
  await handler.result();
} catch (e) {
  if (e instanceof PersistError) {
    console.error("snapshot persistence failed:", e.message);
  }
}
```

---

## Workflow

### `new Workflow(name: string)`

Create a new workflow instance. Default timeout is 300 seconds (5 minutes).

```typescript
const wf = new Workflow("my-workflow");
```

### `.addStep(name: string, eventTypes: string[], handler: StepHandler)`

Register a step that listens for one or more event types.

```typescript
wf.addStep("process", ["MyEvent"], async (event, ctx) => {
  return { type: "blazen::StopEvent", result: { done: true } };
});
```

### `.setTimeout(seconds: number)`

Set the maximum execution time for the workflow in seconds. Set to 0 or negative to disable.

```typescript
wf.setTimeout(30);
```

### `await wf.run(input: object): WorkflowResult`

Run the workflow to completion with the given input.

```typescript
const result = await wf.run({ prompt: "Hello" });
```

### `await wf.runStreaming(input: object, callback: (event) => void): WorkflowResult`

Run the workflow with a streaming callback invoked for each event published via `ctx.writeEventToStream()`.

```typescript
const result = await wf.runStreaming({ prompt: "Hello" }, (event) => {
  console.log("stream:", event);
});
```

### `await wf.runWithHandler(input: object): WorkflowHandler`

Run the workflow and return a handler for pause/resume and streaming control.

```typescript
const handler = await wf.runWithHandler({ prompt: "Hello" });
```

### `await wf.resume(snapshotJson: string): WorkflowHandler`

Resume a previously paused workflow from a snapshot JSON string.

```typescript
const snapshot = fs.readFileSync("snapshot.json", "utf-8");
const handler = await wf.resume(snapshot);
const result = await handler.result();
```

---

## WorkflowResult

```typescript
interface WorkflowResult {
  type: string;     // Event type of the final result (e.g. "blazen::StopEvent")
  data: object;     // Result data extracted from the StopEvent's result field
}
```

---

## StepHandler

```typescript
async (event: object, ctx: Context) => object | object[] | null
```

A step handler receives an event and a context. It can return:

- A single event object to emit one event.
- An array of event objects to fan-out multiple events.
- `null` for side-effect-only steps that emit no events.

---

## WorkflowHandler

Returned by `Workflow.runWithHandler()` and `Workflow.resume()`. Provides control over a running workflow.

**Important:** `result()` consumes the handler internally -- you can only call it once. The other control methods (`pause`, `resumeInPlace`, `abort`, `respondToInput`, `snapshot`) borrow the handler and can be called multiple times.

### `await handler.result(): WorkflowResult`

Await the final workflow result.

### `await handler.pause(): void`

Signal the running workflow to pause. After pausing, use `snapshot()` to get a serializable snapshot, or `resumeInPlace()` to continue execution.

### `await handler.snapshot(): string`

Get a serializable snapshot of the paused workflow as a JSON string. Save this to resume later with `Workflow.resume()`.

### `await handler.resumeInPlace(): void`

Resume a paused workflow in place without creating a new handler.

### `await handler.streamEvents(callback: (event) => void): void`

Subscribe to intermediate events published via `ctx.writeEventToStream()`. Must be called **before** `result()` or `pause()`.

```typescript
const handler = await wf.runWithHandler({ prompt: "Hello" });

// Subscribe to stream events
await handler.streamEvents((event) => console.log(event));

// Then await the result
const result = await handler.result();
```

---

## Events

Events are plain objects with a `type` field.

```typescript
{ type: "MyEvent", payload: "data" }
```

### Start Event

```typescript
{ type: "blazen::StartEvent", ...input }
```

The workflow begins by emitting a `StartEvent` containing the input data.

### Stop Event

```typescript
{ type: "blazen::StopEvent", result: { ... } }
```

Returning a `StopEvent` from a step handler completes the workflow.

---

## Context

Shared workflow context accessible by all steps. All methods are async.

### StateValue

All context values conform to the `StateValue` type:

```typescript
type StateValue = string | number | boolean | null | Buffer | StateValue[] | { [key: string]: StateValue };
```

### `await ctx.set(key: string, value: Exclude<StateValue, Buffer>): void`

Store a JSON-serializable value in the workflow context. Accepts strings, numbers, booleans, null, arrays, and nested objects. For binary data, use `ctx.setBytes()` instead.

The legacy `ctx.set` / `ctx.get` shortcuts still work and route values through the same 4-tier dispatch. For new code, prefer the explicit `ctx.state` / `ctx.session` namespaces documented below.

### `await ctx.get(key: string): Promise<StateValue | null>`

Retrieve a value from the workflow context. Returns `null` if not found. Returns data for **all** `StateValue` variants -- strings, numbers, booleans, arrays, objects, and `Buffer` (if the key was stored via `setBytes`). No data is silently dropped.

### `await ctx.setBytes(key: string, buffer: Buffer): void`

Store raw binary data in the workflow context. Use this for explicit binary storage (e.g., MessagePack, protobuf, raw buffers). Binary data persists through pause/resume/checkpoint.

### `await ctx.getBytes(key: string): Buffer | null`

Retrieve raw binary data from the workflow context. Returns `null` if not found. Note that `ctx.get()` also returns binary data now, so `getBytes` is mainly useful when you want to assert that a key holds binary content.

### `await ctx.runId(): string`

Get the unique run UUID for the current workflow execution.

### `await ctx.sendEvent(event: object): void`

Manually route an event into the workflow event bus. The event will be delivered to any step whose `eventTypes` list includes its type.

### `await ctx.writeEventToStream(event: object): void`

Publish an event to the external broadcast stream. Consumers that subscribed via `runStreaming` or `handler.streamEvents()` will receive this event. Unlike `sendEvent`, this does **not** route the event through the internal step registry.

### `get state(): StateNamespace`

Persistable workflow state. Survives `pause()` / `resume()`, checkpoints, and durable storage. See [StateNamespace](#statenamespace) below.

### `get session(): SessionNamespace`

In-process-only values, excluded from snapshots. Use this for things that should not survive `pause()` / `resume()`. **JS object identity is NOT preserved on Node** -- see the [SessionNamespace](#sessionnamespace) caveat below.

---

## StateNamespace

Namespace for persistable workflow state. Values stored via `state.set` / `state.setBytes` go into the underlying `ContextInner.state` map and survive snapshots, `pause()` / `resume()`, and checkpoint stores.

### `await state.set(key: string, value: Exclude<StateValue, Buffer>): Promise<void>`

Store a JSON-serializable value under the given key.

### `await state.get(key: string): Promise<StateValue | null>`

Retrieve a value previously stored under the given key. Returns `null` if not found.

### `await state.setBytes(key: string, data: Buffer): Promise<void>`

Store raw binary data under the given key.

### `await state.getBytes(key: string): Promise<Buffer | null>`

Retrieve raw binary data previously stored under the given key. Returns `null` if not found.

```typescript
workflow.addStep("step", ["blazen::StartEvent"], async (event, ctx) => {
  await ctx.state.set("counter", 5);
  const count = await ctx.state.get("counter");
  return { type: "blazen::StopEvent", result: { count } };
});
```

---

## SessionNamespace

Namespace for in-process-only workflow values. Values stored via `session.set` are kept in the `ContextInner.objects` side-channel and are **excluded** from snapshots. Use this for state that should not survive a `pause()` / `resume()` round-trip (request IDs, rate-limit counters, ephemeral caches, ...).

**Important -- napi-rs identity caveat:** **JS object identity is NOT preserved** through `ctx.session` on the Node bindings. Values are routed through `serde_json::Value` because napi-rs's `Reference<T>` is `!Send` -- its `Drop` must run on the v8 main thread, and tokio worker threads cannot safely cross the napi boundary with live JS object references. `await ctx.session.get(key)` returns a plain object equal to the one you passed in, **not** the same object instance. For true JS class identity preservation, use the Python or WASM bindings, or keep the work inside a single Rust step. Full identity through events is tracked as a follow-up architectural refactor.

The `session` namespace is still functionally distinct from `state`: session values are excluded from snapshots, state values are not.

### `await session.set(key: string, value: unknown): Promise<void>`

Store a JSON-serializable value under the given key. The value is excluded from snapshots.

### `await session.get(key: string): Promise<unknown>`

Retrieve a value previously stored under the given key. Returns `null` if the key does not exist.

### `await session.has(key: string): Promise<boolean>`

Check whether a value exists under the given key.

### `await session.remove(key: string): Promise<void>`

Remove the value stored under the given key.

```typescript
workflow.addStep("step", ["blazen::StartEvent"], async (event, ctx) => {
  await ctx.session.set("reqId", "abc123");
  if (await ctx.session.has("reqId")) {
    const id = await ctx.session.get("reqId");
    console.log("request id:", id);
  }
  return { type: "blazen::StopEvent", result: {} };
});
```

`state` and `session` use independent keyspaces -- the same key can exist in both namespaces without colliding:

```typescript
await ctx.state.set("k", "state-value");
await ctx.session.set("k", "session-value");
// Both are accessible; they don't collide.
```

---

## BlazenState

Base class for typed state objects with per-field context storage. Extend this class to define structured workflow state that is automatically serialized and deserialized field by field.

### `static meta?: BlazenStateMeta`

Optional static metadata that controls how fields are stored and which fields are transient.

```typescript
interface BlazenStateMeta {
  transient?: Set<string> | string[];
}
```

| Field | Type | Description |
|---|---|---|
| `transient` | `Set<string> \| string[]` | Field names to exclude from persistence. These fields are not saved by `saveTo()` and will not be present after `loadFrom()`. |

### `restore?(): void | Promise<void>`

Optional instance method called by `loadFrom()` after all persisted fields have been restored. Use this to recreate transient fields (caches, connections, derived data) that were excluded from persistence.

### `await state.saveTo(ctx: Context, key: string): Promise<void>`

Persist every non-transient field of the state instance into the workflow context. Each field is stored individually under a namespaced key derived from `key`.

### `static loadFrom<T>(ctx: Context, key: string): Promise<T>`

Restore a state instance from the workflow context. Reads each persisted field, constructs a new instance, and calls `restore()` if defined.

```typescript
const state = await AgentState.loadFrom<AgentState>(ctx, "state");
```

---

## Compute Request Types

Typed request interfaces for compute operations (image generation, video, speech, music, transcription, 3D models).

### ImageRequest

```typescript
interface ImageRequest {
  prompt: string;                  // Text prompt describing the desired image
  negativePrompt?: string;         // Things to avoid in the image
  width?: number;                  // Desired image width in pixels
  height?: number;                 // Desired image height in pixels
  numImages?: number;              // Number of images to generate
  model?: string;                  // Model override (provider-specific)
  parameters?: object;             // Additional provider-specific parameters
}
```

### UpscaleRequest

```typescript
interface UpscaleRequest {
  imageUrl: string;                // URL of the image to upscale
  scale: number;                   // Scale factor (e.g. 2.0, 4.0)
  model?: string;
  parameters?: object;
}
```

### VideoRequest

```typescript
interface VideoRequest {
  prompt: string;                  // Text prompt describing the desired video
  imageUrl?: string;               // Source image URL for image-to-video
  durationSeconds?: number;        // Desired duration in seconds
  negativePrompt?: string;         // Things to avoid
  width?: number;                  // Video width in pixels
  height?: number;                 // Video height in pixels
  model?: string;
  parameters?: object;
}
```

### SpeechRequest

```typescript
interface SpeechRequest {
  text: string;                    // Text to synthesize into speech
  voice?: string;                  // Voice identifier (provider-specific)
  voiceUrl?: string;               // Reference voice sample URL for cloning
  language?: string;               // Language code (e.g. "en", "fr", "ja")
  speed?: number;                  // Speech speed multiplier (1.0 = normal)
  model?: string;
  parameters?: object;
}
```

### MusicRequest

```typescript
interface MusicRequest {
  prompt: string;                  // Text prompt describing the desired audio
  durationSeconds?: number;        // Desired duration in seconds
  model?: string;
  parameters?: object;
}
```

### TranscriptionRequest

```typescript
interface TranscriptionRequest {
  audioUrl: string;                // URL of the audio file to transcribe
  language?: string;               // Language hint (e.g. "en", "fr")
  diarize?: boolean;               // Whether to perform speaker diarization
  model?: string;
  parameters?: object;
}
```

### ThreeDRequest

```typescript
interface ThreeDRequest {
  prompt?: string;                 // Text prompt describing the desired 3D model
  imageUrl?: string;               // Source image URL for image-to-3D
  format?: string;                 // Output format (e.g. "glb", "obj", "usdz")
  model?: string;
  parameters?: object;
}
```

---

## Compute Result Types

### ImageResult

```typescript
interface ImageResult {
  images: GeneratedImage[];        // Generated or upscaled images
  timing?: ComputeTiming;          // Request timing breakdown
  cost?: number;                   // Cost in USD
  metadata: object;                // Provider-specific metadata
}
```

### VideoResult

```typescript
interface VideoResult {
  videos: GeneratedVideo[];
  timing?: ComputeTiming;
  cost?: number;
  metadata: object;
}
```

### AudioResult

```typescript
interface AudioResult {
  audio: GeneratedAudio[];
  timing?: ComputeTiming;
  cost?: number;
  metadata: object;
}
```

### TranscriptionResult

```typescript
interface TranscriptionResult {
  text: string;                          // Full transcribed text
  segments: TranscriptionSegment[];      // Time-aligned segments
  language?: string;                     // Detected or specified language code
  timing?: ComputeTiming;
  cost?: number;
  metadata: object;
}
```

### TranscriptionSegment

```typescript
interface TranscriptionSegment {
  text: string;            // Transcribed text for this segment
  start: number;           // Start time in seconds
  end: number;             // End time in seconds
  speaker?: string;        // Speaker label (if diarization enabled)
}
```

### ThreeDResult

```typescript
interface ThreeDResult {
  models: Generated3DModel[];
  timing?: ComputeTiming;
  cost?: number;
  metadata: object;
}
```

---

## Compute Job Types

Low-level types for generic compute jobs.

### ComputeRequest

```typescript
interface ComputeRequest {
  model: string;             // Model/endpoint to run (e.g. "fal-ai/flux/dev")
  input: object;             // Input parameters (model-specific)
  webhook?: string;          // Webhook URL for async completion notification
}
```

### JobHandle

```typescript
interface JobHandle {
  id: string;                // Provider-assigned job identifier
  provider: string;          // Provider name (e.g. "fal", "replicate", "runpod")
  model: string;             // Model/endpoint that was invoked
  submittedAt: string;       // ISO 8601 timestamp
}
```

### JobStatus

String enum for compute job status.

```typescript
JobStatus.Queued      // "queued"
JobStatus.Running     // "running"
JobStatus.Completed   // "completed"
JobStatus.Failed      // "failed"
JobStatus.Cancelled   // "cancelled"
```

### ComputeResult

```typescript
interface ComputeResult {
  job?: JobHandle;           // Job handle that produced this result
  output: object;            // Output data (model-specific)
  timing?: ComputeTiming;    // Request timing breakdown
  cost?: number;             // Cost in USD
  metadata: object;          // Raw provider-specific metadata
}
```

### ComputeTiming

```typescript
interface ComputeTiming {
  queueMs?: number;          // Time spent waiting in queue (ms)
  executionMs?: number;      // Time spent executing (ms)
  totalMs?: number;          // Total wall-clock time (ms)
}
```

---

## Media Output Types

### MediaOutput

A single piece of generated media content.

```typescript
interface MediaOutput {
  url?: string;              // URL where the media can be downloaded
  base64?: string;           // Base64-encoded media data
  rawContent?: string;       // Raw text content (SVG, OBJ, GLTF JSON)
  mediaType: string;         // MIME type (e.g. "image/png", "video/mp4")
  fileSize?: number;         // File size in bytes
  metadata: object;          // Provider-specific metadata
}
```

### GeneratedImage

```typescript
interface GeneratedImage {
  media: MediaOutput;
  width?: number;            // Image width in pixels
  height?: number;           // Image height in pixels
}
```

### GeneratedVideo

```typescript
interface GeneratedVideo {
  media: MediaOutput;
  width?: number;            // Video width in pixels
  height?: number;           // Video height in pixels
  durationSeconds?: number;  // Duration in seconds
  fps?: number;              // Frames per second
}
```

### GeneratedAudio

```typescript
interface GeneratedAudio {
  media: MediaOutput;
  durationSeconds?: number;  // Duration in seconds
  sampleRate?: number;       // Sample rate in Hz
  channels?: number;         // Number of audio channels
}
```

### Generated3DModel

```typescript
interface Generated3DModel {
  media: MediaOutput;
  vertexCount?: number;      // Total vertex count
  faceCount?: number;        // Total face/triangle count
  hasTextures: boolean;      // Whether the model includes textures
  hasAnimations: boolean;    // Whether the model includes animations
}
```

---

## EmbeddingModel

Generate vector embeddings from text. Created via static factory methods. Keys are read from environment variables (`OPENAI_API_KEY`, `TOGETHER_API_KEY`, etc.) when `options` is omitted, or can be passed explicitly via `{ apiKey: "..." }`.

```typescript
import { EmbeddingModel } from "blazen";

const model = EmbeddingModel.openai();
const together = EmbeddingModel.together();
const cohere = EmbeddingModel.cohere({ apiKey: "co-..." });
const fireworks = EmbeddingModel.fireworks();
```

### Provider Factory Methods

| Method | Default Model | Default Dimensions |
|---|---|---|
| `EmbeddingModel.openai(options?)` | `text-embedding-3-small` | 1536 |
| `EmbeddingModel.together(options?)` | `togethercomputer/m2-bert-80M-8k-retrieval` | 768 |
| `EmbeddingModel.cohere(options?)` | `embed-v4.0` | 1024 |
| `EmbeddingModel.fireworks(options?)` | `nomic-ai/nomic-embed-text-v1.5` | 768 |

### Properties

| Property | Type | Description |
|---|---|---|
| `.modelId` | `string` | The model identifier. |
| `.dimensions` | `number` | Output vector dimensionality. |

### `await model.embed(texts: string[]): EmbeddingResponse`

Embed one or more texts, returning one vector per input.

```typescript
const response = await model.embed(["Hello", "World"]);
console.log(response.embeddings.length);    // 2
console.log(response.embeddings[0].length); // 1536
```

---

## EmbeddingResponse

Returned by `EmbeddingModel.embed()`.

```typescript
interface EmbeddingResponse {
  embeddings: number[][];         // One vector per input text
  model: string;                  // Model that produced the embeddings
  usage?: TokenUsage;             // Token usage statistics
  cost?: number;                  // Estimated cost in USD
  timing?: RequestTiming;         // Request timing breakdown
  metadata: object;               // Provider-specific metadata
}
```

---

## Token Estimation

Lightweight token counting functions. Uses a heuristic (~3.5 characters per token) suitable for budget checks without external data files.

### `estimateTokens(text: string, contextSize?: number): number`

Estimate token count for a text string.

```typescript
import { estimateTokens } from "blazen";

const count = estimateTokens("Hello, world!");  // 4
```

### `countMessageTokens(messages: ChatMessage[], contextSize?: number): number`

Estimate total tokens for an array of chat messages, including per-message overhead.

```typescript
import { countMessageTokens, ChatMessage } from "blazen";

const count = countMessageTokens([
  ChatMessage.system("You are helpful."),
  ChatMessage.user("Hello!"),
]);
```

`contextSize` defaults to `128000` if omitted.

---

## Subclassable Providers

`CompletionModel`, `EmbeddingModel`, and `Transcription` can be subclassed to implement custom providers. Override the relevant methods and the framework will dispatch to your implementation.

### CompletionModel

```typescript
import { CompletionModel, ChatMessage } from "blazen";

class MyLLM extends CompletionModel {
  constructor() {
    super({ modelId: "my-llm" });
  }

  async complete(messages: ChatMessage[]) {
    // Your inference logic here
    return { content: "Hello from my custom model" };
  }

  async stream(messages: ChatMessage[], onChunk: (chunk: any) => void) {
    onChunk({ delta: "Hello", finishReason: null, toolCalls: [] });
    onChunk({ delta: null, finishReason: "stop", toolCalls: [] });
  }
}

const model = new MyLLM();
const response = await model.complete([ChatMessage.user("Hi")]);
```

### EmbeddingModel

```typescript
import { EmbeddingModel } from "blazen";

class MyEmbedder extends EmbeddingModel {
  constructor() {
    super({ modelId: "my-embedder", dimensions: 128 });
  }

  async embed(texts: string[]) {
    return {
      embeddings: texts.map(() => new Array(128).fill(0.1)),
      model: "my-embedder",
    };
  }
}
```

### Transcription

```typescript
import { Transcription } from "blazen";

class MyTranscriber extends Transcription {
  constructor() {
    super({ providerId: "my-stt" });
  }

  async transcribe(request: any) {
    return { text: "transcribed text", segments: [] };
  }
}
```

---

## Per-Capability Provider Classes

Seven provider base classes let you implement a single compute capability without dealing with the full `ComputeProvider` interface. Subclass and override the relevant methods.

| Class | Methods to Override | Rust Trait |
|---|---|---|
| `TTSProvider` | `textToSpeech(request)` | `AudioGeneration` |
| `MusicProvider` | `generateMusic(request)`, `generateSfx(request)` | `AudioGeneration` |
| `ImageProvider` | `generateImage(request)`, `upscaleImage(request)` | `ImageGeneration` |
| `VideoProvider` | `textToVideo(request)`, `imageToVideo(request)` | `VideoGeneration` |
| `ThreeDProvider` | `generate3d(request)` | `ThreeDGeneration` |
| `BackgroundRemovalProvider` | `removeBackground(request)` | `BackgroundRemoval` |
| `VoiceProvider` | `cloneVoice(request)`, `listVoices()`, `deleteVoice(voice)` | `VoiceCloning` |

### Constructor

All provider classes share the same constructor config:

```typescript
new TTSProvider({
  providerId: string,
  baseUrl?: string,
  pricing?: ModelPricing,
  vramEstimateBytes?: number,
})
```

| Field | Type | Description |
|---|---|---|
| `providerId` | `string` | Identifier for the provider instance. |
| `baseUrl` | `string?` | Optional base URL for the provider API. |
| `pricing` | `ModelPricing?` | Optional pricing info for cost tracking. |
| `vramEstimateBytes` | `number?` | Optional VRAM estimate for `ModelManager` integration. |

### Example

```typescript
import { TTSProvider } from "blazen";

class ElevenLabsTTS extends TTSProvider {
  private apiKey: string;

  constructor(apiKey: string) {
    super({ providerId: "elevenlabs" });
    this.apiKey = apiKey;
  }

  async textToSpeech(request: any) {
    // Call ElevenLabs API with this.apiKey
    return { audio: audioBuffer, format: "mp3" };
  }
}

const tts = new ElevenLabsTTS("sk-...");
const result = await tts.textToSpeech({ text: "Hello world", voice: "alice" });
```

---

## MemoryBackend

Base class for custom memory storage backends. Subclass to implement persistence backed by Postgres, SQLite, DynamoDB, or any other store.

```typescript
import { MemoryBackend } from "blazen";

class PostgresBackend extends MemoryBackend {
  async put(entry: any): Promise<void> {
    // Insert or update entry in Postgres
  }

  async get(id: string): Promise<any | null> {
    // Retrieve entry by id
  }

  async delete(id: string): Promise<boolean> {
    // Delete entry, return true if it existed
  }

  async list(): Promise<any[]> {
    // Return all entries
  }

  async len(): Promise<number> {
    // Return count of entries
  }

  async searchByBands(bands: any, limit: number): Promise<any[]> {
    // Return candidates sharing LSH bands with the query
  }
}
```

### Methods to Override

| Method | Signature | Description |
|---|---|---|
| `put` | `async put(entry): Promise<void>` | Insert or update a stored entry. |
| `get` | `async get(id: string): Promise<any \| null>` | Retrieve a stored entry by id. |
| `delete` | `async delete(id: string): Promise<boolean>` | Delete an entry by id. Returns `true` if it existed. |
| `list` | `async list(): Promise<any[]>` | Return all stored entries. |
| `len` | `async len(): Promise<number>` | Return the number of stored entries. |
| `searchByBands` | `async searchByBands(bands, limit): Promise<any[]>` | Return candidate entries sharing at least one LSH band. |

---

## ProgressCallback

Subclassable base for download progress callbacks. Pass an instance to `ModelCache.download()` (and other download-capable APIs) to receive byte-count progress updates.

`onProgress` takes `bigint` byte counts so multi-gigabyte downloads keep full precision. `total` is `null` when the server does not send `Content-Length`.

```typescript
import { ModelCache, ProgressCallback } from "blazen";

class MyProgress extends ProgressCallback {
  onProgress(downloaded: bigint, total?: bigint | null): void {
    if (total != null) {
      const pct = Number((downloaded * 100n) / total);
      console.log(`${pct}%`);
    } else {
      console.log(`${downloaded} bytes`);
    }
  }
}

const cache = ModelCache.create();
await cache.download("bert-base-uncased", "config.json", new MyProgress());
```

The base `onProgress` always throws — overriding it is mandatory. `super()` must be called from the subclass constructor.

`ProgressCallback` instances are accepted anywhere the SDK exposes a download hook — `ModelCache.download()`, the local-inference `Provider.create()` paths that pull weights from HuggingFace, and the `ProgressCallback`-aware variants of dataset loaders. Pass the same `ProgressCallback` instance to multiple downloads to centralise reporting (e.g. for a TUI progress bar).

---

## ModelManager

VRAM budget-aware model manager with LRU eviction. Tracks registered local models and their estimated VRAM footprint. When loading a model that would exceed the budget, the least-recently-used loaded model is unloaded first.

### Constructor

```typescript
import { ModelManager } from "blazen";

const manager = new ModelManager({ budgetGb: 24 });
// or — bigint is required so budgets above 4 GiB don't silently truncate
const manager = new ModelManager({ budgetBytes: 24n * 1_073_741_824n });
```

| Field | Type | Description |
|---|---|---|
| `budgetGb` | `number?` | VRAM budget in gigabytes. Provide exactly one of `budgetGb` or `budgetBytes`. |
| `budgetBytes` | `bigint?` | VRAM budget in bytes. Use `bigint` literals (e.g. `24n * 1_073_741_824n`) so budgets above 4 GiB don't truncate. |

### Methods

| Method | Signature | Description |
|---|---|---|
| `register` | `await manager.register(id, model, vramEstimate?: bigint)` | Register a model with its estimated VRAM footprint (in bytes, as a `bigint`). Starts unloaded. |
| `load` | `await manager.load(id)` | Load a model, evicting LRU models if needed. |
| `unload` | `await manager.unload(id)` | Unload a model and free its VRAM. |
| `isLoaded` | `await manager.isLoaded(id): boolean` | Check if a model is currently loaded. |
| `ensureLoaded` | `await manager.ensureLoaded(id)` | Alias for `load()`. |
| `usedBytes` | `await manager.usedBytes(): bigint` | Total VRAM currently used by loaded models, in bytes. |
| `availableBytes` | `await manager.availableBytes(): bigint` | Available VRAM within the budget, in bytes. |
| `status` | `await manager.status(): ModelStatus[]` | Status of all registered models. |

### ModelStatus

```typescript
interface ModelStatus {
  id: string;                // Model identifier
  loaded: boolean;           // Whether the model is currently loaded
  vramEstimate: bigint;      // Estimated VRAM footprint in bytes
}
```

> **Why `bigint`?** The byte-budget surface (`budgetBytes`, `register`'s `vramEstimate`, `usedBytes()`, `availableBytes()`, `ModelStatus.vramEstimate`) used to be `number` (`u32` on the Rust side), which capped budgets at ~4 GiB and silently truncated larger inputs — a real footgun for 7B+ local models that need 8 GiB+ of VRAM. Pass values as `BigInt` literals (`8n * 1_073_741_824n`) or via `BigInt(8 * 1024 ** 3)`. The `budgetGb: number` constructor path is unchanged for users who prefer plain numbers and gigabyte granularity.

---

## ModelRegistry

An ABC for model catalogs. Subclass it to advertise the models your code knows about — used by capability-discovery code, dynamic model menus, and parity with the Rust `blazen_llm::traits::ModelRegistry` trait.

```typescript
export declare class ModelRegistry {
  listModels(): Promise<ModelInfo[]>;
  getModel(modelId: string): Promise<ModelInfo | null>;
}
```

| Method | Signature | Description |
|---|---|---|
| `listModels` | `async listModels(): Promise<ModelInfo[]>` | Return every model the registry advertises. |
| `getModel` | `async getModel(modelId: string): Promise<ModelInfo \| null>` | Look up a single model by id, or return `null` if unknown. |

The base implementations of both methods throw — overriding them is mandatory. `super()` must be called from the subclass constructor.

```typescript
import { ModelRegistry } from "blazen";
import type { ModelInfo } from "blazen";

class MyRegistry extends ModelRegistry {
  async listModels(): Promise<ModelInfo[]> {
    return [
      { id: "gpt-4o", provider: "openai" /* ...other ModelInfo fields */ },
      { id: "claude-sonnet-4", provider: "anthropic" /* ... */ },
    ];
  }

  async getModel(modelId: string): Promise<ModelInfo | null> {
    const all = await this.listModels();
    return all.find((m) => m.id === modelId) ?? null;
  }
}
```

See the `ModelInfo` reference for the full set of fields each entry must populate (id, provider, capabilities, context window, pricing, etc.).

Mirrors `PyModelRegistry` (Python) and `WasmModelRegistry` exposed as `ModelRegistry` (WASM SDK) — subclassing `ModelRegistry` in any binding produces the same Rust-side `blazen_llm::traits::ModelRegistry` implementation.

---

## ModelPricing and Pricing Functions

### ModelPricing

Pricing metadata for cost tracking.

```typescript
interface ModelPricing {
  inputPerMillion?: number;     // Cost per million input tokens (USD)
  outputPerMillion?: number;    // Cost per million output tokens (USD)
  perImage?: number;            // Cost per generated image (USD)
  perSecond?: number;           // Cost per second of compute (USD)
}
```

### registerPricing()

Register custom pricing for a model. Overrides any existing pricing for the same model ID.

```typescript
import { registerPricing } from "blazen";

registerPricing("my-model", { inputPerMillion: 1.0, outputPerMillion: 2.0 });
```

### lookupPricing()

Look up pricing for a model by ID. Returns `null` if the model is unknown.

```typescript
import { lookupPricing } from "blazen";

const pricing = lookupPricing("gpt-4o");
if (pricing) {
  console.log(`Input: $${pricing.inputPerMillion}/M tokens`);
}
```

---

## LocalModel Methods on CompletionModel

`CompletionModel` instances backed by local inference (not remote APIs) support explicit load/unload lifecycle management.

| Method | Signature | Description |
|---|---|---|
| `load` | `await model.load(): void` | Load the model into memory/VRAM. Idempotent. |
| `unload` | `await model.unload(): void` | Free the model's memory/VRAM. Idempotent. |
| `isLoaded` | `await model.isLoaded(): boolean` | Whether the model is currently loaded. |
| `vramBytes` | `await model.vramBytes(): number \| null` | Approximate memory footprint in bytes, or `null` if unknown. |

```typescript
// For a local model:
await model.load();
console.log(await model.isLoaded());    // true
console.log(await model.vramBytes());   // e.g. 4000000000
await model.unload();
```

---

## Error Handling

Errors thrown across the FFI boundary are surfaced as instances of typed `BlazenError` subclasses. Every error class extends the base `BlazenError`, which extends the standard JavaScript `Error`, so existing `instanceof Error` checks keep working while gaining structural classification.

The `BlazenError` hierarchy is what makes typed error routing possible — any caught value can be matched against `BlazenError` (catch-all for anything from the SDK), against the direct subclass (broad category like `RateLimitError`), or against a leaf class (specific failure like `LlamaCppModelLoadError`). Use whichever level of specificity your handler needs.

### Root and direct subclasses

`BlazenError` is the root. The following 18 classes extend it directly:

| Class | When it's thrown |
|---|---|
| `AuthError` | Invalid or expired API key. |
| `RateLimitError` | Provider rate limit reached. |
| `TimeoutError` | Request exceeded its deadline. |
| `ValidationError` | Invalid request parameters or option set. |
| `ContentPolicyError` | Content moderated by the provider. |
| `ProviderError` | Provider-specific error (HTTP status / endpoint detail attached — see below). |
| `UnsupportedError` | Feature not supported by the chosen provider. |
| `ComputeError` | Compute job failure (cancelled, quota exhausted, runtime failure). |
| `MediaError` | Invalid or oversized media content. |
| `PeerEncodeError` | Failed to encode a peer envelope. |
| `PeerTransportError` | Network failure between peers. |
| `PeerEnvelopeVersionError` | Peer protocol version mismatch. |
| `PeerWorkflowError` | Remote peer workflow failed. |
| `PeerTlsError` | TLS handshake or cert validation failed for a peer connection. |
| `PeerUnknownStepError` | Peer requested an unknown workflow step. |
| `PersistError` | Snapshot persistence backend failure. |
| `PromptError` | Prompt registry / template failure. |
| `MemoryError` | Memory store / embedder failure. |
| `CacheError` | Model cache / download failure. |

### `ProviderError` structured fields

`ProviderError` carries structured context in addition to the message string. All fields are nullable.

| Field | Type | Description |
|---|---|---|
| `.provider` | `string \| null` | Provider name (e.g. `"openai"`, `"anthropic"`). |
| `.status` | `number \| null` | HTTP status code, when the call reached the provider. |
| `.endpoint` | `string \| null` | The endpoint that returned the error. |
| `.requestId` | `string \| null` | Provider-assigned request id (use this when filing support tickets). |
| `.detail` | `string \| null` | Provider-supplied error detail / body. |
| `.retryAfterMs` | `number \| null` | Suggested back-off when the provider returned a `Retry-After` hint. |

### Per-backend `ProviderError` subclasses

Each local-inference and provider-side backend has its own `ProviderError` subclass with narrower variants. Use `instanceof` to route to backend-specific handling.

| Class | Backend | Representative narrower subclasses |
|---|---|---|
| `LlamaCppError` | llama.cpp | `LlamaCppInvalidOptionsError`, `LlamaCppModelLoadError`, `LlamaCppInferenceError`, `LlamaCppEngineNotAvailableError` |
| `MistralRsError` | mistral.rs | `MistralRsInvalidOptionsError`, `MistralRsInitError`, `MistralRsInferenceError`, `MistralRsEngineNotAvailableError` |
| `CandleLlmError` | candle (LLM) | `CandleLlmInvalidOptionsError`, `CandleLlmModelLoadError`, `CandleLlmInferenceError`, `CandleLlmEngineNotAvailableError` |
| `CandleEmbedError` | candle (embeddings) | `CandleEmbedModelLoadError`, `CandleEmbedEmbeddingError`, `CandleEmbedEngineNotAvailableError`, `CandleEmbedTaskPanickedError` |
| `WhisperError` | whisper.cpp | `WhisperModelLoadError`, `WhisperTranscriptionError`, `WhisperEngineNotAvailableError`, `WhisperIoError` |
| `PiperError` | Piper TTS | `PiperModelLoadError`, `PiperSynthesisError`, `PiperEngineNotAvailableError` |
| `DiffusionError` | diffusion image gen | `DiffusionModelLoadError`, `DiffusionGenerationError` |
| `FastEmbedError` | fastembed | `EmbedUnknownModelError`, `EmbedInitError`, `EmbedEmbedError`, `EmbedMutexPoisonedError`, `EmbedTaskPanickedError` |
| `TractError` | tract ONNX runtime | (no narrower variants) |

`PromptError` similarly has narrower variants like `PromptMissingVariableError`, `PromptNotFoundError`, `PromptVersionNotFoundError`, `PromptIoError`, `PromptYamlError`, `PromptJsonError`, `PromptValidationError`. `MemoryError` exposes `MemoryNoEmbedderError`, `MemoryEmbeddingError`, `MemoryNotFoundError`, `MemorySerializationError`, `MemoryIoError`, `MemoryBackendError`. `CacheError` exposes `DownloadError`, `CacheDirError`, `IoError`. There are around 80 narrower subclasses in total — every public Rust error variant gets its own JS class.

### `enrichError(err: unknown): unknown`

Plain `Error` instances thrown across the FFI boundary lose their original Rust type. Pass any caught value through `enrichError` to re-classify it into the proper `BlazenError` subclass before further inspection. It's a no-op when the error is already typed.

```typescript
import { enrichError, BlazenError } from "blazen";

try {
  await model.complete([ChatMessage.user("Hi")]);
} catch (raw) {
  const err = enrichError(raw);
  if (err instanceof BlazenError) {
    // typed handling
  }
  throw err;
}
```

### Example: routing on the typed hierarchy

```typescript
import {
  RateLimitError, AuthError, TimeoutError,
  ProviderError, LlamaCppEngineNotAvailableError,
} from "blazen";

try {
  const response = await model.complete([ChatMessage.user("Hello")]);
} catch (e) {
  if (e instanceof RateLimitError) {
    await sleep(e instanceof ProviderError && e.retryAfterMs ? e.retryAfterMs : 1000);
  } else if (e instanceof AuthError) {
    refreshApiKey();
  } else if (e instanceof TimeoutError) {
    // safe to retry once
  } else if (e instanceof LlamaCppEngineNotAvailableError) {
    console.error("Build was compiled without llama.cpp support");
  } else if (e instanceof ProviderError) {
    console.error(`[${e.provider} ${e.status ?? "?"}] ${e.detail ?? e.message}`);
  } else {
    throw e;
  }
}
```

`RateLimitError`, `TimeoutError`, transient `ProviderError`s with `status >= 500`, and most `PeerTransportError`s are safe to retry. `AuthError`, `ValidationError`, `ContentPolicyError`, and `UnsupportedError` are not.

### When to call `enrichError`

The Rust core always emits typed `BlazenError` subclasses, but errors that originate in JS callbacks (tool handlers, persist callbacks, custom providers) and bubble back through Rust come out as plain `Error` instances. If you want uniform `instanceof BlazenError` matching everywhere, run every caught value through `enrichError` at the catch site:

```typescript
import { enrichError, BlazenError, ProviderError } from "blazen";

async function safeCall<T>(fn: () => Promise<T>): Promise<T> {
  try {
    return await fn();
  } catch (raw) {
    const err = enrichError(raw);
    if (err instanceof ProviderError && err.retryAfterMs) {
      await new Promise(r => setTimeout(r, err.retryAfterMs!));
    }
    throw err;
  }
}
```

`enrichError` is idempotent — passing it an already-typed `BlazenError` returns the same value unchanged, so it's safe to layer.

---

## Local Inference Types

Local backends (`MistralRsProvider`, `LlamaCppProvider`, `CandleLlmProvider`) expose typed input/output classes for direct use without going through the generic `CompletionModel` surface. Two parallel families exist — un-prefixed `*` for mistral.rs (the canonical surface) and `LlamaCpp*` for llama.cpp — plus a single `CandleInferenceResult` for the candle backend.

### mistral.rs (canonical, un-prefixed)

| Class / enum | Purpose |
|---|---|
| `ChatMessageInput` | Inference-side chat message. Constructor: `new ChatMessageInput(role, text, images?)`; static `ChatMessageInput.fromText(role, text)`. Getters: `.role`, `.text`, `.images`, `.hasImages`. |
| `ChatRole` | Const enum: `System`, `User`, `Assistant`, `Tool`. |
| `InferenceImage` | Image attachment for vision-capable models. Static factories: `fromBytes(buf)`, `fromPath(path)`, `fromSource(src)`. |
| `InferenceImageSource` | Tagged-union source. Static factories: `bytes(buf)`, `path(p)`. Getters: `.kind` (`"bytes"` or `"path"`), `.data`, `.filePath`. |
| `InferenceResult` | Non-streaming result. Getters: `.content`, `.reasoningContent`, `.toolCalls`, `.finishReason`, `.model`, `.usage`. |
| `InferenceChunk` | Streaming chunk. Getters: `.delta`, `.reasoningDelta`, `.toolCalls`, `.finishReason`. |
| `InferenceChunkStream` | Async chunk source. Pull with `await stream.next()`; returns `null` when exhausted. |
| `InferenceToolCall` | Tool call requested by the model. Constructor `new InferenceToolCall(id, name, arguments)`; getters `.id`, `.name`, `.arguments` (JSON string). |
| `InferenceUsage` | Token usage. Getters: `.promptTokens`, `.completionTokens`, `.totalTokens`, `.totalTimeSec`. |

```typescript
import { ChatMessageInput, ChatRole, InferenceImage, InferenceChunkStream, MistralRsProvider } from "blazen";

const provider = await MistralRsProvider.create({ modelId: "..." });
const stream: InferenceChunkStream = await provider.inferStream([
  ChatMessageInput.fromText(ChatRole.User, "Describe this image"),
]);
for (let chunk = await stream.next(); chunk !== null; chunk = await stream.next()) {
  process.stdout.write(chunk.delta ?? "");
}
```

`InferenceChunkStream` is single-pass — once you've reached the terminating `null`, the stream is exhausted. Errors raised from inside `InferenceChunkStream.next()` are typed `BlazenError` subclasses (`MistralRsInferenceError`, etc.) so they can be matched alongside the rest of the error hierarchy.

### llama.cpp (`LlamaCpp` prefix)

The llama.cpp surface mirrors the mistral.rs one with a narrower feature set (no reasoning content, no images on the message input itself).

| Class / enum | Purpose |
|---|---|
| `LlamaCppChatMessageInput` | Constructor: `new LlamaCppChatMessageInput(role, text)`. Getters: `.role`, `.text`. |
| `LlamaCppChatRole` | Const enum: `System`, `User`, `Assistant`, `Tool` (capitalised — distinct from `ChatRole`). |
| `LlamaCppInferenceResult` | Non-streaming result. Getters: `.content`, `.finishReason`, `.model`, `.usage`. |
| `LlamaCppInferenceChunk` | Streaming chunk. Getters: `.delta`, `.finishReason`. |
| `LlamaCppInferenceChunkStream` | Async chunk source. Same `await stream.next()` pattern. |
| `LlamaCppInferenceUsage` | Getters: `.promptTokens`, `.completionTokens`, `.totalTokens`, `.totalTimeSec`. |

```typescript
import {
  LlamaCppChatMessageInput, LlamaCppChatRole,
  LlamaCppInferenceChunkStream, LlamaCppProvider,
} from "blazen";

const provider = await LlamaCppProvider.create({ modelPath: "/models/llama.gguf" });
const stream: LlamaCppInferenceChunkStream = await provider.inferStream([
  new LlamaCppChatMessageInput(LlamaCppChatRole.User, "What is 2+2?"),
]);
for (let chunk = await stream.next(); chunk !== null; chunk = await stream.next()) {
  process.stdout.write(chunk.delta ?? "");
}
```

Like the mistral.rs `InferenceChunkStream`, `LlamaCppInferenceChunkStream` is single-pass; mid-stream failures throw a typed `LlamaCppInferenceError`.

### candle

The candle backend exposes a single non-streaming result class.

| Class | Purpose |
|---|---|
| `CandleInferenceResult` | Constructor: `new CandleInferenceResult(content, promptTokens, completionTokens, totalTimeSecs)`. Getters: `.content`, `.promptTokens`, `.completionTokens`, `.totalTimeSecs`. |

The candle backend has no streaming counterpart to `InferenceChunkStream` / `LlamaCppInferenceChunkStream` — pull `CandleInferenceResult` once per call. If you need token-by-token streaming on candle, swap the provider to mistral.rs or llama.cpp.

### Errors raised from local inference

All three families propagate errors as typed `BlazenError` subclasses. The mapping is documented in the [Error Handling](#error-handling) section above. As elsewhere, run callbacks through `enrichError` to re-classify any plain `Error` that bubbles back through Rust from JS-side code (custom samplers, custom token decoders, etc.).

---

## Telemetry

OpenTelemetry-compatible tracing flows through the standard `tracing` subscriber. Blazen ships an optional Langfuse exporter that ships span batches to the Langfuse ingestion API. This is gated by the `langfuse` Cargo feature on the underlying crate, so it's only available in builds that opted in at compile time.

### `LangfuseConfig`

```typescript
import { LangfuseConfig } from "blazen";

const config = new LangfuseConfig(
  process.env.LANGFUSE_PUBLIC_KEY!,
  process.env.LANGFUSE_SECRET_KEY!,
  "https://cloud.langfuse.com",  // host (optional, defaults to cloud)
  100,                             // batchSize (optional, default 100)
  5000,                            // flushIntervalMs (optional, default 5000)
);
```

| Property | Type | Description |
|---|---|---|
| `.publicKey` | `string` | The Langfuse public API key. |
| `.secretKey` | `string` | The Langfuse secret API key. |
| `.host` | `string \| null` | The configured host URL, or `null` when defaulted. |
| `.batchSize` | `number` | Maximum events buffered before an automatic flush. |
| `.flushIntervalMs` | `number` | Background flush interval in milliseconds. |

### `initLangfuse(config: LangfuseConfig): void`

Install the global tracing subscriber. Spawns a background tokio task that flushes buffered span envelopes to Langfuse on the configured interval. Calling this more than once per process is safe — subsequent calls no-op because the global subscriber is already registered.

```typescript
import { initLangfuse, LangfuseConfig } from "blazen";

initLangfuse(new LangfuseConfig(
  process.env.LANGFUSE_PUBLIC_KEY!,
  process.env.LANGFUSE_SECRET_KEY!,
));
```

Available only when the host build was compiled with the `langfuse` feature on the underlying telemetry crate. In builds without it, the symbol is still exported but the configured exporter is a no-op.

### Wiring `LangfuseConfig` from environment

Most deployments construct `LangfuseConfig` directly from environment variables at startup. Tune `batchSize` and `flushIntervalMs` to balance ingestion latency against the per-request overhead of HTTP flushes:

```typescript
import { initLangfuse, LangfuseConfig } from "blazen";

const cfg = new LangfuseConfig(
  process.env.LANGFUSE_PUBLIC_KEY!,
  process.env.LANGFUSE_SECRET_KEY!,
  process.env.LANGFUSE_HOST,                                  // null → cloud default
  Number(process.env.LANGFUSE_BATCH_SIZE ?? 100),
  Number(process.env.LANGFUSE_FLUSH_INTERVAL_MS ?? 5000),
);
initLangfuse(cfg);
```

A `LangfuseConfig` instance is purely declarative — it carries no IO state — so it's safe to construct, inspect, log (with secrets redacted), and pass to `initLangfuse` independently.

You can keep multiple `LangfuseConfig` objects around (e.g. one per environment) and choose which one to install at startup; only the first `initLangfuse` call wins per process.

---

## version()

Returns the Blazen library version string.

```typescript
import { version } from "blazen";

console.log(version()); // "0.1.0"
```

---

# WASM API Reference

Source: https://blazen.dev/docs/api/wasm
Language: wasm
Section: api

## init()

Initialize the WASM module. Must be called once before using any other export.

```typescript
import init from '@blazen/sdk';

await init();
```

Returns a `Promise<void>`. Subsequent calls are no-ops.

---

## CompletionModel

A chat completion model. Created via static factory methods for each provider.

### Provider Factory Methods

All factory methods take **no arguments** -- API keys are read from environment variables at runtime (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, `FAL_KEY`, `OPENROUTER_API_KEY`, etc.). To override the default model, chain `.withModel(...)` on the returned instance.

Azure additionally requires `resourceName` and `deploymentName` as arguments (there's no single global endpoint). Bedrock requires `region`.

| Method | Signature |
|---|---|
| `CompletionModel.openai` | `()` |
| `CompletionModel.anthropic` | `()` |
| `CompletionModel.gemini` | `()` |
| `CompletionModel.azure` | `(resourceName: string, deploymentName: string)` |
| `CompletionModel.fal` | `()` |
| `CompletionModel.openrouter` | `()` |
| `CompletionModel.groq` | `()` |
| `CompletionModel.together` | `()` |
| `CompletionModel.mistral` | `()` |
| `CompletionModel.deepseek` | `()` |
| `CompletionModel.fireworks` | `()` |
| `CompletionModel.perplexity` | `()` |
| `CompletionModel.xai` | `()` |
| `CompletionModel.cohere` | `()` |
| `CompletionModel.bedrock` | `(region: string)` |

```typescript
const model = CompletionModel.openai();
const claude = CompletionModel.anthropic();
const gemini = CompletionModel.gemini();
const azure = CompletionModel.azure('my-resource', 'my-deployment');
const fal = CompletionModel.fal();
const groq = CompletionModel.groq().withModel('llama-3.3-70b-versatile');
const bedrock = CompletionModel.bedrock('us-east-1');
```

#### `model.withModel(modelId: string): CompletionModel`

Override the default model ID for this provider instance. Returns a new `CompletionModel` (WASM does not mutate in place).

```typescript
const model = CompletionModel.openai().withModel('gpt-4o-mini');
```

### Properties

| Property | Type | Description |
|---|---|---|
| `.modelId` | `string` | The model identifier string |

### `await model.complete(messages: ChatMessage[]): CompletionResponse`

Perform a chat completion.

```typescript
const response = await model.complete([
  ChatMessage.system('You are a helpful assistant.'),
  ChatMessage.user('What is 2 + 2?'),
]);
console.log(response.content);
```

### `await model.completeWithOptions(messages: ChatMessage[], options: CompletionOptions): CompletionResponse`

Perform a chat completion with additional options.

```typescript
const response = await model.completeWithOptions(
  [ChatMessage.user('Write a haiku about WASM.')],
  { temperature: 0.7, maxTokens: 100 }
);
```

### `await model.stream(messages: ChatMessage[], onChunk: (chunk) => void): void`

Stream a chat completion. The callback receives each chunk as it arrives.

```typescript
await model.stream(
  [ChatMessage.user('Tell me a story')],
  (chunk) => {
    if (chunk.delta) process.stdout.write(chunk.delta);
  }
);
```

Each chunk has the shape:

```typescript
{
  delta?: string;              // Text content delta
  finishReason?: string;       // Set on the final chunk
  toolCalls: ToolCall[];       // Tool calls, if any
}
```

### Middleware Decorators

Each decorator returns a new `CompletionModel` wrapping the original.

#### `model.withRetry(maxRetries?: number): CompletionModel`

Automatic retry with exponential backoff on transient failures. Defaults to 3 retries.

```typescript
const resilient = model.withRetry(5);
```

#### `model.withCache(ttlSeconds?: number, maxEntries?: number): CompletionModel`

In-memory response cache. Streaming requests bypass the cache.

```typescript
const cached = model.withCache(600, 500);
```

| Parameter | Default | Description |
|---|---|---|
| `ttlSeconds` | `300` | Cache entry TTL in seconds. |
| `maxEntries` | `1000` | Maximum entries before eviction. |

#### `CompletionModel.withFallback(models: CompletionModel[]): CompletionModel`

Static method. Tries providers in order; falls back on transient errors.

```typescript
const model = CompletionModel.withFallback([
  CompletionModel.openai(),
  CompletionModel.groq(),
]);
```

---

## ChatMessage

A class for building typed chat messages.

### Static Factory Methods

| Method | Description |
|---|---|
| `ChatMessage.system(content: string)` | Create a system message |
| `ChatMessage.user(content: string)` | Create a user message |
| `ChatMessage.assistant(content: string)` | Create an assistant message |
| `ChatMessage.tool(content: string)` | Create a tool result message |
| `ChatMessage.toolResultMessage(callId: string, name: string, content: string)` | Create a tool result message with a tool-call ID and function name. (Named `toolResultMessage` to avoid colliding with the `.toolResult` instance getter that surfaces the structured payload of an existing message.) |
| `ChatMessage.userImageUrl(text: string, url: string, mediaType?: string)` | User message with text and an image URL |
| `ChatMessage.userImageBase64(text: string, data: string, mediaType: string)` | User message with text and a base64-encoded image |

```typescript
const msg = ChatMessage.user('Hello');
const sys = ChatMessage.system('You are a helpful assistant.');
const img = ChatMessage.userImageUrl('Describe this:', 'https://example.com/photo.jpg');
```

### Constructor

```typescript
new ChatMessage({ role?: string, content?: string, parts?: ContentPart[] })
```

### Properties

| Property | Type | Description |
|---|---|---|
| `.role` | `string` | `"system"`, `"user"`, `"assistant"`, or `"tool"` |
| `.content` | `string \| undefined` | The text content of the message |
| `.toolCallId` | `string \| undefined` | The tool-call ID this message is responding to (only set for tool-result messages) |
| `.name` | `string \| undefined` | The function name of the tool that produced this result (only set for tool-result messages) |

### JSON Shape

`ChatMessage.toJSON()` (and the entries in the `messages` array returned by `runAgent`) match the tsify-generated `ChatMessage` interface:

```typescript
interface ChatMessage {
  role: "system" | "user" | "assistant" | "tool";
  content: MessageContent;
  tool_call_id?: string;
  name?: string;
  tool_calls?: ToolCall[];
  tool_result?: ToolOutput;  // structured tool-result payload (see below)
}
```

The `tool_result` field is populated when a tool handler returns a non-string value or supplies an `llm_override`. Plain-string tool results live in `content` as `MessageContent::Text` instead. The field name is `tool_result` (snake_case) because tsify preserves Rust field naming.

> **Tip.** The tsify-generated interface lives in `crates/blazen-wasm-sdk/pkg/blazen_wasm_sdk.d.ts` and is regenerated on every `pnpm build` (or `wasm-pack build --target bundler`) inside `crates/blazen-wasm-sdk/`. See the [WASM Quickstart](/docs/guides/wasm/quickstart) for the build flow.

---

## ToolOutput

The two-channel tool result emitted by JS tool handlers and surfaced on `ChatMessage.tool_result`.

```typescript
interface ToolOutput {
  /**
   * The full structured payload the caller sees programmatically. Any
   * JSON-serializable value (object, array, string, number, etc).
   */
  data: any;

  /**
   * Optional override for the body sent back to the model on the next
   * turn. When `null` / absent, each provider applies its default
   * conversion from `data`.
   */
  llm_override?: LlmPayload | null;
}
```

The two channels exist because what the rest of your application wants to consume from a tool (full structured data, large blobs, internal IDs) is rarely the best thing to feed back to the LLM (token-heavy, leaks internal shape). Set `data` to the rich payload your code consumes, and use `llm_override` when you want to send the model a trimmed summary or a provider-specific shape instead.

The WASM dispatcher accepts either `llm_override` (snake) or `llmOverride` (camel) when a JS handler returns a structured object — both spellings are normalized before the value is parsed. This means the spelling you write in JS is up to you; both work.

### Tool handler return shapes

The WASM tool dispatcher (`js_to_tool_output` in `crates/blazen-wasm-sdk/src/agent.rs`) accepts two shapes from a handler:

```javascript
// 1. Bare value: wrapped automatically as { data: <value>, llm_override: null }.
const tool = {
  name: 'getWeather',
  description: 'Get the current weather for a city',
  parameters: { type: 'object', properties: { city: { type: 'string' } }, required: ['city'] },
  handler: async (args) => ({ temp: 22, condition: 'cloudy', city: args.city }),
};

// 2. Structured ToolOutput: object literal with a `data` key.
const structuredTool = {
  name: 'fetchProfile',
  description: 'Fetch a full user profile',
  parameters: { type: 'object', properties: { userId: { type: 'string' } }, required: ['userId'] },
  handler: async (args) => {
    const profile = await db.users.findById(args.userId);  // huge blob
    return {
      data: profile,                                         // caller sees full record
      llmOverride: {                                         // model sees compact summary
        kind: 'text',
        text: `User ${profile.name} (id=${profile.id})`,
      },
    };
  },
};
```

If a handler returns a string and that string happens to parse as JSON describing a `ToolOutput`, the dispatcher unpacks it. Otherwise the string is preserved as plain text. If `ToolOutput` deserialization fails (for instance, a malformed `llm_override`), the dispatcher silently falls back to wrapping the raw value as `{ data, llm_override: null }` rather than throwing.

---

## LlmPayload

The override sent to the LLM on the next turn (the optional `llm_override` field of `ToolOutput`). Discriminated union keyed by `kind`:

| `kind` | Shape | Description |
|---|---|---|
| `"text"` | `{ kind: "text"; text: string }` | Plain text. Works on every provider universally. |
| `"json"` | `{ kind: "json"; value: any }` | Structured JSON. Anthropic and Gemini consume it natively; OpenAI / Responses / Azure / Fal stringify at the wire boundary. |
| `"parts"` | `{ kind: "parts"; parts: ContentPart[] }` | Multimodal content blocks. Anthropic supports natively as `tool_result.content` blocks; OpenAI falls back to text; Gemini falls back to a JSON object. |
| `"provider_raw"` | `{ kind: "provider_raw"; provider: ProviderId; value: any }` | Provider-specific escape hatch. Only the named provider sees `value`; every other provider falls back to the default conversion from `ToolOutput.data`. |

`ProviderId` is the snake_case enum:

```typescript
type ProviderId =
  | "openai"
  | "openai_compat"
  | "azure"
  | "anthropic"
  | "gemini"
  | "responses"
  | "fal";
```

### Per-provider behavior for `data` (when `llm_override` is null)

When no override is set, each provider applies its default conversion to `ToolOutput.data`:

- **OpenAI / OpenAI-compat / Azure / Responses / Fal:** the value is JSON-stringified once and sent as the tool-result string.
- **Anthropic:** structured data becomes `[{ type: "text", text: <stringified> }]` so it lands inside `tool_result.content` blocks.
- **Gemini:** a structured object passes through as the `response` field of the function-response part; scalar values (numbers, booleans, strings) are wrapped as `{ result: scalar }` to satisfy Gemini's object-only contract.

### Examples

```typescript
// Send a compact text summary while keeping the full record in `data`.
return {
  data: { id: 42, items: bigArray, internal: '...' },
  llmOverride: { kind: 'text', text: '42 items processed' },
};

// Force JSON semantics regardless of provider default.
return {
  data: { ok: true, count: 17 },
  llmOverride: { kind: 'json', value: { ok: true, count: 17 } },
};

// Multimodal: send an image back to the model (Anthropic native; falls
// back to text on OpenAI; falls back to a JSON object on Gemini).
return {
  data: { url: 'https://example.com/photo.png' },
  llmOverride: {
    kind: 'parts',
    parts: [
      { type: 'text', text: 'Here is the photo:' },
      { type: 'image', source: { url: 'https://example.com/photo.png' } },
    ],
  },
};

// Anthropic-only escape hatch: send a raw provider blob.
return {
  data: { ok: true },
  llmOverride: {
    kind: 'provider_raw',
    provider: 'anthropic',
    value: { type: 'text', text: 'anthropic-only payload' },
  },
};
```

---

## CompletionResponse

Returned by `model.complete()` and `model.completeWithOptions()`.

```typescript
interface CompletionResponse {
  content?: string;                // The generated text
  toolCalls: ToolCall[];           // Tool calls requested by the model
  usage?: TokenUsage;              // Token usage statistics
  model: string;                   // Model name used
  finishReason?: string;           // "stop", "tool_calls", etc.
  cost?: number;                   // Cost in USD
  timing?: RequestTiming;          // Request timing breakdown
  metadata: object;                // Raw provider-specific metadata
}
```

---

## CompletionOptions

Options for `completeWithOptions()`.

```typescript
interface CompletionOptions {
  temperature?: number;        // Sampling temperature (0.0 - 2.0)
  maxTokens?: number;          // Maximum tokens to generate
  topP?: number;               // Nucleus sampling parameter
  model?: string;              // Override the default model ID
  tools?: ToolDefinition[];    // Tool definitions for function calling
}
```

---

## ToolCall

A tool invocation requested by the model.

| Property | Type | Description |
|---|---|---|
| `.id` | `string` | Unique identifier for the tool call |
| `.name` | `string` | Name of the tool to invoke |
| `.arguments` | `object` | Parsed JSON arguments |

---

## ToolDefinition

Describes a tool that the model may invoke.

```typescript
interface ToolDefinition {
  name: string;            // Unique tool name
  description: string;     // Human-readable description
  parameters: object;      // JSON Schema for the tool's parameters
}
```

---

## Content Subsystem

Browser/edge-friendly subset of Blazen's multimodal content store. The WASM surface omits filesystem-bound APIs available in the native crate -- there is no `localFile()` factory, and `ContentStore` does not expose a `metadata()` method.

### ContentKind

String union describing the taxonomy of stored content. Treat as `#[non_exhaustive]` -- new kinds may be added in future releases.

| Value | Description |
|---|---|
| `"image"` | Raster or vector image |
| `"audio"` | Audio clip |
| `"video"` | Video clip |
| `"document"` | Document (PDF, text, etc.) |
| `"three_d_model"` | 3D model (glTF, OBJ, etc.) |
| `"cad"` | CAD file (STEP, IGES, etc.) |
| `"archive"` | Archive (zip, tar, etc.) |
| `"font"` | Font file |
| `"code"` | Source code |
| `"data"` | Structured data (CSV, JSON, etc.) |
| `"other"` | Catch-all for anything else |

### ContentHandle

Opaque, store-issued reference to a piece of content. Field names mirror the wire format (snake_case).

```typescript
interface ContentHandle {
  id: string;              // Opaque store-defined identifier
  kind: ContentKind;       // Used for type-checking at the tool-input boundary
  mime_type?: string;      // MIME type if known
  byte_size?: number;      // Byte size if known
  display_name?: string;   // Human-readable name (e.g. original filename)
}
```

### ImageSource

Discriminated union of every way an image can be supplied to a `ChatMessage`.

```typescript
type ImageSource =
  | { type: "url"; url: string }
  | { type: "base64"; data: string }
  | { type: "file"; path: string }
  | { type: "provider_file"; provider: ProviderId; id: string }
  | { type: "handle"; handle: ContentHandle };
```

The `handle` variant defers resolution to the active `ContentStore`, which substitutes one of the other variants at request-build time.

### ContentStore

Abstract handle-issuing store. Build built-in instances via the static factories below, supply user-defined backends via `ContentStore.custom({...})`, or `extends ContentStore` from JS / TypeScript and override the methods you need. Resources are released by `free()` or via the explicit-resource-management protocol (`[Symbol.dispose]`).

```typescript
class ContentStore {
  // Subclass-friendly base constructor (call from `super()`)
  constructor();

  // Factories
  static inMemory(): ContentStore;
  static openaiFiles(apiKey: string): ContentStore;
  static anthropicFiles(apiKey: string): ContentStore;
  static geminiFiles(apiKey: string): ContentStore;
  static falStorage(apiKey: string): ContentStore;
  static custom(options: CustomContentStoreOptions): ContentStore;

  // Instance methods
  put(
    body: Uint8Array | string,
    kindHint?: string | null,
    mimeType?: string | null,
    displayName?: string | null,
  ): Promise<ContentHandle>;
  resolve(handle: ContentHandle): Promise<unknown>;   // MediaSource-shaped JS object
  fetchBytes(handle: ContentHandle): Promise<Uint8Array>;
  delete(handle: ContentHandle): Promise<void>;

  // Lifecycle
  free(): void;
  [Symbol.dispose](): void;
}
```

`put` accepts either a `Uint8Array` of bytes or a `string` URL (URL support depends on the backing store). `kindHint` is the wire string -- for example `"image"` or `"three_d_model"` -- and overrides auto-detection.

```typescript
import { ContentStore } from '@blazen/sdk';

// Explicit-resource-management form (preferred)
using store = ContentStore.inMemory();
const handle = await store.put(bytes, 'image', 'image/png', 'logo.png');
const media = await store.resolve(handle);

// Manual lifecycle
const fallback = ContentStore.openaiFiles(apiKey);
try {
  const h = await fallback.put(bytes, 'document', 'application/pdf');
} finally {
  fallback.free();
}
```

### Subclassing `ContentStore`

`ContentStore` is subclassable from JavaScript / TypeScript via wasm-bindgen. Override the methods your backend needs; the SDK wraps your subclass in a Rust adapter that dispatches into your JS async functions via `js_sys::Function::call` + `wasm_bindgen_futures::JsFuture`.

```typescript
import { ContentStore } from "@blazen/sdk";
import type { ContentHandle } from "@blazen/sdk";

class IndexedDBContentStore extends ContentStore {
  constructor() {
    super();
  }

  async put(body, hint) {
    // ... persist to IndexedDB / OPFS / fetch+rehost ...
    return { id: "...", kind: "image" };
  }

  async resolve(handle) {
    return { sourceType: "url", url: "..." };
  }

  async fetchBytes(handle) {
    return new Uint8Array([...]);
  }

  // Optional:
  async fetchStream(handle) { return new Uint8Array([...]); }
  async delete(handle) { /* no-op */ }
}
```

Subclasses MUST override `put`, `resolve`, `fetchBytes`. The base-class default impls throw a `JsError` so any missing override fails clearly rather than silently recursing via `super()`.

### `ContentStore.custom({...})`

Callback-based factory. Direct JS mirror of Rust `CustomContentStore::builder`.

```typescript
ContentStore.custom(options: {
  put: (body: any, hint: any) => Promise<ContentHandle>;
  resolve: (handle: ContentHandle) => Promise<any>;            // serialized MediaSource
  fetchBytes: (handle: ContentHandle) => Promise<Uint8Array>;
  fetchStream?: (handle: ContentHandle) => Promise<Uint8Array>; // single-chunk for now
  delete?: (handle: ContentHandle) => Promise<void>;
}): ContentStore
```

`put`, `resolve`, `fetchBytes` are required. `fetchStream` and `delete` are optional. The `body` arrives as a JS object shaped like `{type: "bytes", data: [...]}` / `{type: "url", url}` / `{type: "provider_file", provider, id}` / `{type: "stream", stream: ReadableStream<Uint8Array>, sizeHint: number | null}` (no `local_path` in WASM since there's no filesystem).

`resolve` returns a serialized `MediaSource` JS object. `fetchBytes` returns a `Uint8Array`. `fetchStream` may return either a `Uint8Array` / `number[]` (legacy, single-chunk) or a `ReadableStream<Uint8Array>` for true chunk-by-chunk streaming.

### Built-in stores

| Factory | Purpose |
|---|---|
| `ContentStore.inMemory()` | Ephemeral in-WASM-memory store. Bytes live in WASM heap; URL/provider-file inputs are recorded by reference. |
| `ContentStore.openaiFiles(apiKey)` | Uploads to the OpenAI Files API. `apiKey` is sent as `Bearer <apiKey>`. |
| `ContentStore.anthropicFiles(apiKey)` | Uploads to the Anthropic Files API. `apiKey` is sent as `x-api-key`. |
| `ContentStore.geminiFiles(apiKey)` | Uploads to the Google AI / Gemini Files API. |
| `ContentStore.falStorage(apiKey)` | Uploads to fal.ai storage. |
| `ContentStore.custom({...})` | User-defined backend via async callbacks (see above). |

### Tool-input schema helpers

Each helper returns a JSON Schema fragment with a `x-blazen-content-ref` extension that tells Blazen's resolver which content kind the model is expected to pass.

| Helper | Kind |
|---|---|
| `imageInput(name, description)` | `image` |
| `audioInput(name, description)` | `audio` |
| `videoInput(name, description)` | `video` |
| `fileInput(name, description)` | `document` |
| `threeDInput(name, description)` | `three_d_model` |
| `cadInput(name, description)` | `cad` |

```typescript
import { imageInput } from '@blazen/sdk';

const params = imageInput('photo', 'The user-supplied photograph to analyze');
// {
//   type: "object",
//   properties: {
//     photo: {
//       type: "string",
//       description: "The user-supplied photograph to analyze",
//       "x-blazen-content-ref": { kind: "image" }
//     }
//   },
//   required: ["photo"]
// }
```

### How resolution works

The model only ever sees the schema's `string` type and passes the handle id back as a plain string. The `x-blazen-content-ref` extension is invisible to providers. Before invoking the tool handler, Blazen's resolver looks up the id in the active `ContentStore`, fetches a typed content shape (e.g. an `ImageSource`), and substitutes it into the tool arguments. Handlers therefore receive resolved content rather than raw ids.

---

## TokenUsage

| Property | Type | Description |
|---|---|---|
| `.promptTokens` | `number` | Tokens in the prompt |
| `.completionTokens` | `number` | Tokens in the completion |
| `.totalTokens` | `number` | Total tokens used |

---

## RequestTiming

| Property | Type | Description |
|---|---|---|
| `.queueMs` | `number \| undefined` | Time in queue (ms) |
| `.executionMs` | `number \| undefined` | Execution time (ms) |
| `.totalMs` | `number \| undefined` | Total wall-clock time (ms) |

---

## runAgent

Run an agentic tool-calling loop.

```typescript
const result = await runAgent(model, messages, tools, options?);
```

### Parameters

| Parameter | Type | Description |
|---|---|---|
| `model` | `CompletionModel` | The completion model to use |
| `messages` | `ChatMessage[]` | Initial conversation messages |
| `tools` | `ToolDef[]` | Tool definitions. Each tool object has `name`, `description`, `parameters` (JSON Schema), and a `handler(args) => any \| Promise<any>` that returns either a bare JSON-serializable value or a structured [`ToolOutput`](#tooloutput) (an object with a `data` key, plus an optional `llm_override` / `llmOverride`). See the [tool handler return shapes](#tool-handler-return-shapes) above. |
| `options` | `AgentRunOptions?` | Optional configuration |

### AgentRunOptions

```typescript
interface AgentRunOptions {
  toolConcurrency?: number;  // Max concurrent tool calls per round (default: 0 = unlimited)
  maxIterations?: number;    // Max tool-calling iterations (default: 10)
  systemPrompt?: string;     // System prompt prepended to conversation
  temperature?: number;      // Sampling temperature
  maxTokens?: number;        // Max tokens per call
  addFinishTool?: boolean;   // Add a built-in "finish" tool
}
```

### AgentResult

```typescript
interface AgentResult {
  content?: string;                // Final text response
  messages: ChatMessage[];         // Full message history (each entry matches the tsify ChatMessage interface)
  iterations: number;              // Number of iterations
  totalUsage?: TokenUsage;         // Aggregated token usage
  totalCost?: number;              // Aggregated cost in USD
}
```

Tool-result messages in `messages` carry a `tool_result?: ToolOutput` field whenever the handler returned a non-string `data` or supplied an `llm_override`. Plain string returns appear as `content: { Text: "..." }` on a `role: "tool"` message with no `tool_result` field.

---

## Workflow

### `new Workflow(name: string)`

Create a new workflow instance.

### `.addStep(name: string, eventTypes: string[], handler: StepHandler)`

Register a step that listens for one or more event types.

```typescript
wf.addStep('process', ['MyEvent'], async (event, ctx) => {
  return { type: 'blazen::StopEvent', result: { done: true } };
});
```

### `await wf.run(input: object): any`

Run the workflow to completion. The input is passed as the `data` field of a synthetic `StartEvent`. Returns a `Promise` that resolves to the `result` field of the final `StopEvent`.

### `await wf.runStreaming(input: any, callback: (event: any) => void): Promise<any>`

Run the workflow and forward each lifecycle event to a JS callback as it occurs, resolving with the terminal payload once the workflow completes. Mirrors the Node binding's `runStreaming(input, onEvent)`. Stream events are subscribed *before* the engine begins dispatching, so no events are missed between dispatch and subscription.

```typescript
await wf.runStreaming({ topic: 'TS' }, (event) => {
  console.log(event.event_type, event.data);
});
```

The callback is invoked with `{ event_type, data }` per event. Errors raised synchronously by the listener are swallowed so a misbehaving callback does not abort the run.

### `await wf.runWithHandler(input: any): Promise<WorkflowHandler>`

Build and dispatch the workflow, returning the live [`WorkflowHandler`](#workflowhandler) instead of awaiting the terminal event. Use this when you want to drive the run yourself: call `awaitResult()` for the final payload, `pause()` / `snapshot()` for mid-flight state, `nextEvent()` / `streamEvents()` for events, or `cancel()` / `abort()` to tear it down. Functionally equivalent to `runHandler(input)` — the JS-side name `runWithHandler` exists for parity with the Node binding.

```typescript
const handler = await wf.runWithHandler({ topic: 'TS' });
await handler.streamEvents((ev) => console.log(ev));
const result = await handler.awaitResult();
```

### `wf.setSessionPausePolicy(policy: string): void`

Configure how live session refs are treated when the workflow is paused or snapshotted. Mirrors the Node binding's `setSessionPausePolicy`. The policy is applied when the workflow is dispatched via `run` / `runHandler` / `runStreaming` / `runWithHandler` / `resumeFromSnapshot` / `resumeWithSerializableRefs`.

| Policy | Behavior |
|---|---|
| `"pickle_or_error"` (default) | Pickle live refs if the binding supports it; error otherwise. |
| `"pickle_or_serialize"` | Pickle if possible; fall back to user-supplied byte serialization. |
| `"warn_drop"` | Log a warning and drop live refs from the snapshot. |
| `"hard_error"` | Always error if any live refs are present. |

PascalCase spellings (`PickleOrError`, `PickleOrSerialize`, `WarnDrop`, `HardError`) are also accepted.

`WorkflowBuilder` exposes the same method (chainable, returns `WorkflowBuilder`).

### `await wf.resumeWithSerializableRefs(snapshot: any, deserializers: Record<string, (bytes: Uint8Array) => unknown>): Promise<WorkflowHandler>`

Resume a workflow from a snapshot whose `__blazen_serialized_session_refs` sidecar carries JS-serialized session refs. The `deserializers` object maps `type_tag` strings to `(bytes: Uint8Array) => unknown` callbacks. For every entry in the sidecar whose tag appears in `deserializers`, the callback is invoked synchronously with the captured bytes; the return value is ignored (callbacks should populate any application state they need).

The snapshot's bytes are also exposed inside step handlers via `ctx.getSessionRefSerializable(key)` after resume, mirroring the Node binding's path.

```typescript
const handler = await wf.resumeWithSerializableRefs(snapshot, {
  'app::EmbeddingHandle': (bytes) => {
    myStore.rehydrate(bytes); // populate user-side state
  },
});
const result = await handler.awaitResult();
```

Snapshots without serialized session refs work fine with `resumeFromSnapshot()`; this method is only required when the original pause used `SessionPausePolicy::PickleOrSerialize`.

---

## WorkflowHandler

Live handle to an in-flight workflow run, returned by `runHandler()` / `runWithHandler()` / `resumeFromSnapshot()` / `resumeWithSerializableRefs()`. The handler lets JS callers drive a workflow run beyond the simple "fire and forget" pattern of `run()`.

### Methods

| Method | Signature | Description |
|---|---|---|
| `awaitResult` | `() => Promise<any>` | Await the workflow's terminal payload. Consumes the inner handler. |
| `pause` | `() => Promise<any>` | Park the event loop and capture a quiescent snapshot. |
| `snapshot` | `() => Promise<any>` | Capture the current snapshot **without** halting the loop. Use this for logging or telemetry; pair `pause()` -> `snapshot()` -> `resumeInPlace()` for a quiescent view. |
| `resumeInPlace` | `() => void` | Resume a paused event loop. The same handler instance remains valid for `awaitResult` / `nextEvent` etc. |
| `cancel` | `() => void` | Tear down the event loop. Best-effort; errors if the loop has already exited. |
| `abort` | `() => void` | Pure alias for `cancel()`. Matches `JsWorkflowHandler::abort` in the Node bindings — use whichever name reads better. |
| `runId` | `() => Promise<any>` | Return the run's UUID as a string. First call captures a snapshot to read it, then caches. |
| `nextEvent` | `() => Promise<any>` | Pull the next event from the broadcast stream. Resolves with `null` when the stream closes. |
| `streamEvents` | `(callback: (event: any) => void) => Promise<void>` | Subscribe to the broadcast stream and forward each event to a JS callback until the stream closes. Mirrors the Node binding. Single Promise drives the subscription — no need to wrap repeated `nextEvent()` calls. |
| `respondToInput` | `(requestId: string, response: any) => void` | Deliver a human-in-the-loop response to a workflow that auto-parked on an `InputRequestEvent`. Pass the matching `request_id` and a JSON-serializable response value. |

### Streaming events

```typescript
const handler = await wf.runWithHandler({ topic: 'WASM' });
await handler.streamEvents((event) => {
  console.log(event.event_type, event.data);
});
const result = await handler.awaitResult();
```

Events emitted before `streamEvents()` is called are not replayed — call it before `awaitResult()` to avoid races.

### Pause and snapshot

```typescript
const handler = await wf.runWithHandler({ topic: 'WASM' });
await handler.pause();
const snap = await handler.snapshot();
localStorage.setItem('snap', JSON.stringify(snap));
handler.resumeInPlace();
const result = await handler.awaitResult();
```

### Human-in-the-loop

```typescript
const handler = await wf.runWithHandler({});
const event = await handler.nextEvent();
if (event?.event_type === 'InputRequestEvent') {
  const userInput = await prompt(event.data.prompt);
  handler.respondToInput(event.data.request_id, { answer: userInput });
}
const result = await handler.awaitResult();
```

---

## Pipeline

Pipelines compose multiple `Workflow`s into a sequential or parallel chain. Each `Stage` wraps one workflow plus optional input-mapping and conditional-execution callbacks; the resulting `Pipeline` runs the stages in order, threading the previous stage's output into the next stage's input.

### `new PipelineBuilder(name: string)`

Construct a new builder with the given pipeline name.

### Methods

| Method | Signature | Description |
|---|---|---|
| `pipelineBuilder.stage(stage)` | `(stage: Stage) => void` | Append a sequential stage. Consumes the stage — the same `Stage` instance cannot be added to two pipelines. |
| `pipelineBuilder.parallel(parallel)` | `(parallel: ParallelStage) => void` | Append a parallel stage that fans out across multiple branches. |
| `pipelineBuilder.timeoutPerStage(seconds)` | `(seconds: number) => void` | Set a per-stage timeout. Exceeding it surfaces as a stage failure with `WorkflowError::Timeout`. |
| `pipelineBuilder.onPersist(callback)` | `(callback: (snapshot: any) => Promise<void>) => void` | Persist callback that receives a typed `PipelineSnapshot` (serialized to a JS object via `serde-wasm-bindgen`) after each stage completes. The engine awaits the returned `Promise` before continuing. |
| `pipelineBuilder.onPersistJson(callback)` | `(callback: (json: string) => Promise<void>) => void` | Persist callback that receives the snapshot serialized as a JSON string. The engine awaits the returned `Promise` before continuing. |
| `pipelineBuilder.build()` | `() => Pipeline` | Finalize and return a runnable `Pipeline`. |

### IndexedDB persistence example

```typescript
const builder = new PipelineBuilder('research');
builder.stage(new Stage('outline', outlineWf));
builder.stage(new Stage('draft', draftWf, (state) => ({ outline: state.outline })));
builder.onPersistJson(async (json) => {
  const db = await openDb();
  await db.put('snapshots', { id: 'research', json });
});
const pipeline = builder.build();
```

### Stage

```typescript
new Stage(
  name: string,
  workflow: Workflow,
  input_mapper?: ((state: BlazenState) => unknown) | null,
  condition?: ((state: BlazenState) => boolean) | null,
);
```

`input_mapper` is an optional `(state: BlazenState) => unknown` JS callable invoked before the stage runs. Its return value becomes the workflow's input. When `null` / `undefined`, the previous stage's output (or the pipeline input for the first stage) is passed through directly.

`condition` is an optional `(state: BlazenState) => boolean` JS callable that decides whether the stage runs. When `null` / `undefined` the stage always runs; when the callable returns `false` the stage is skipped (its `StageResult.skipped` is `true` and `output` is `null`).

```typescript
const stage = new Stage(
  'summarize',
  summarizeWf,
  (state) => ({ text: state.draft, maxWords: 100 }),
  (state) => state.draft != null && state.draft.length > 200,
);
```

### ParallelStage

```typescript
new ParallelStage(name: string, branches: Stage[], join_strategy?: JoinStrategy | null);
```

Each branch is a `Stage`; branches execute concurrently and are joined according to `JoinStrategy.WaitAll` (default) or `JoinStrategy.FirstCompletes`. Branch `Stage` instances are consumed when the parallel stage is constructed.

```typescript
import { ParallelStage, Stage, JoinStrategy } from '@blazen/sdk';

const fanOut = new ParallelStage(
  'fan-out',
  [new Stage('a', wfA), new Stage('b', wfB)],
  JoinStrategy.WaitAll,
);
```

---

## Context (WasmContext)

Shared workflow context accessible by all steps. Unlike the Node.js SDK, all methods are **synchronous** -- no `await` needed.

### StateValue

Values stored in the context can be any `StateValue`:

```typescript
type StateValue = string | number | boolean | null | Uint8Array | StateValue[] | { [key: string]: StateValue };
```

### Methods

| Method | Signature | Description |
|---|---|---|
| `ctx.set(key, value)` | `(key: string, value: StateValue) => void` | Store a value. Auto-detects `Uint8Array` and stores it as binary; everything else is stored as-is. |
| `ctx.get(key)` | `(key: string) => StateValue \| null` | Retrieve a value. Returns `Uint8Array` for binary data, the original `JsValue` for everything else, or `null` if the key is missing. |
| `ctx.setBytes(key, data)` | `(key: string, data: Uint8Array) => void` | Explicitly store binary data. |
| `ctx.getBytes(key)` | `(key: string) => Uint8Array \| null` | Retrieve binary data. Returns `null` if the key is missing. |
| `ctx.sendEvent(event)` | `(event: object) => void` | Queue an event into the workflow event loop. |
| `ctx.writeEventToStream(event)` | `(event: object) => void` | No-op in WASM. Present for API compatibility with the Node.js and Rust SDKs. |
| `ctx.runId()` | `() => string` | Returns the unique UUID v4 for the current workflow run. |
| `ctx.insertSessionRefSerializable(typeName, bytes)` | `(typeName: string, bytes: Uint8Array) => string` | Store an opaque, user-serialized payload in the session-ref registry under a fresh registry key. `typeName` is a stable identifier the caller chooses (e.g. `"app::EmbeddingHandle"`); it is captured into snapshot metadata along with the bytes when the workflow is paused under `SessionPausePolicy::PickleOrSerialize`. Returns the registry key as a string. JS code must serialize the value itself (typically into a `Uint8Array`) before calling and deserialize on retrieval. |
| `ctx.getSessionRefSerializable(key)` | `(key: string) => { typeName: string; bytes: Uint8Array } \| null` | Retrieve a previously inserted opaque payload. Returns `null` if the registry has no entry under `key`, or if the entry exists but was inserted via the non-serializable path (`set` / `setBytes` / language-specific live refs). |

The session-ref-serializable wire format is **cross-binding compatible** with the Node binding's `NodeSessionRefSerializable`: the same `typeName` / `bytes` pair round-trips through a snapshot taken on one binding and resumed on the other.

```typescript
// Inside a step handler.
const key = ctx.insertSessionRefSerializable(
  'app::EmbeddingHandle',
  new TextEncoder().encode(JSON.stringify({ id: 42 })),
);
ctx.state.set('embedding_key', key);

// In a later step (or after resume).
const stored = ctx.getSessionRefSerializable(ctx.state.get('embedding_key') as string);
if (stored) {
  const obj = JSON.parse(new TextDecoder().decode(stored.bytes));
}
```

### Properties

| Property | Type | Description |
|---|---|---|
| `ctx.workflowName` | `string` | Getter property returning the workflow name. |
| `ctx.state` | `StateNamespace` | Getter returning the persistable workflow state namespace. Survives snapshotting when the WASM runner gains snapshot support. Routes through the same JS / bytes dispatch as `ctx.set` / `ctx.get`. |
| `ctx.session` | `SessionNamespace` | Getter returning the live in-process JS reference namespace. **Identity IS preserved** within a single workflow run (unlike the Node bindings). Excluded from snapshots. |

---

## StateNamespace

Persistable workflow state, accessed via `ctx.state`. Routes values through the same `set` / `get` / `setBytes` / `getBytes` dispatch as the legacy `ctx.set` / `ctx.get`, so anything stored here will survive snapshotting once the WASM runner gains snapshot support.

All methods are **synchronous** -- the WASM runtime has no tokio.

### Methods

| Method | Signature | Description |
|---|---|---|
| `state.set(key, value)` | `(key: string, value: unknown) => void` | Store a value. Auto-detects `Uint8Array` and stores it as binary; everything else is stored as-is. |
| `state.get(key)` | `(key: string) => unknown` | Retrieve a value. Returns `Uint8Array` for binary data, the original `JsValue` for everything else, or `null` if the key is missing. |
| `state.setBytes(key, data)` | `(key: string, data: Uint8Array) => void` | Explicitly store binary data. |
| `state.getBytes(key)` | `(key: string) => Uint8Array \| null` | Retrieve binary data. Returns `null` if the key is missing. |

```typescript
ctx.state.set("counter", 5);
const count = ctx.state.get("counter");
```

---

## SessionNamespace

Live in-process JS references, accessed via `ctx.session`. Values are stored as raw `JsValue` in a separate map and are **excluded** from any snapshot.

> **Identity IS preserved within a run on WASM.** Because the WASM runtime is single-threaded, session values are stored as raw `JsValue` and `ctx.session.get(key) === obj` holds after `ctx.session.set(key, obj)`. This is a meaningful differentiator from the Node bindings, where identity is **not** preserved due to napi-rs threading constraints (values are round-tripped through `serde_json::Value`).

All methods are **synchronous**.

### Methods

| Method | Signature | Description |
|---|---|---|
| `session.set(key, value)` | `(key: string, value: unknown) => void` | Store a live JS reference under `key`. The value is kept as-is. |
| `session.get(key)` | `(key: string) => unknown` | Retrieve the value previously stored under `key`. Returns `null` if missing. |
| `session.has(key)` | `(key: string) => boolean` | Check whether a value exists under `key`. |
| `session.remove(key)` | `(key: string) => void` | Remove the value stored under `key`. |

```typescript
const conn = openConnection();
ctx.session.set("conn", conn);
console.log(ctx.session.get("conn") === conn);   // true
```

> WASM does **not** currently support cross-process snapshot/resume of session entries. Session values exist only within a single workflow run.

---

## BlazenState

A protocol for structured per-field state storage in the WASM context. Objects carrying the `__blazen_state__: true` marker are automatically decomposed by `ctx.set()` and reconstructed by `ctx.get()`.

### BlazenStateMeta

Configuration is read from the object's constructor via a static `meta` property.

```typescript
interface BlazenStateMeta {
  /** Field names excluded from serialization (recreated via restore). */
  transient?: string[];
  /** Name of the method to call after reconstruction. */
  restore?: string;
}
```

### Detection

Any JS object with the property `__blazen_state__` set to a truthy value is treated as a `BlazenState`:

```typescript
const state = new MyState();
state.__blazen_state__ = true;  // enables the protocol
```

### Per-field storage

When `ctx.set(key, state)` receives a `BlazenState` object:

1. Each enumerable field is stored individually at `{key}.{fieldName}` (skipping the marker and transient fields).
2. A metadata entry is written at `{key}.__blazen_meta__` recording the field list, class name, transient array, and restore method name.

When `ctx.get(key)` finds a `{key}.__blazen_meta__` entry:

1. Each recorded field is loaded individually.
2. The fields are assembled into a new plain object.
3. The `__blazen_state__` marker is set on the result.
4. If a `restore` method name was recorded, that method is called on the reconstructed object.

### restore()

The `restore` entry in `meta` is a **string** naming the method on the instance to call after reconstruction. This method receives no arguments and is called synchronously:

```typescript
class MyState {
  dbPath = '';
  conn = null;

  static meta = {
    transient: ['conn'],
    restore: 'reconnect',
  };

  reconnect() {
    this.conn = openDb(this.dbPath);
  }
}
```

### Synchronous execution

All `BlazenState` operations in WASM are synchronous. Unlike the Node.js SDK (where `saveTo()` / `loadFrom()` return `Promise`s), the WASM context processes `BlazenState` objects inline during `ctx.set()` and `ctx.get()` -- no `await` needed.

---

## Events

Events are plain objects with a `type` field.

### Start Event

```typescript
{ type: 'blazen::StartEvent', ...input }
```

### Stop Event

```typescript
{ type: 'blazen::StopEvent', result: { ... } }
```

---

## EmbeddingModel

Generate vector embeddings from text. Created via static factory methods. All factory methods take no arguments -- API keys are read from environment variables (`OPENAI_API_KEY`, `TOGETHER_API_KEY`, `COHERE_API_KEY`, `FIREWORKS_API_KEY`).

```typescript
import { EmbeddingModel } from '@blazen/sdk';

const model = EmbeddingModel.openai();
const together = EmbeddingModel.together();
const cohere = EmbeddingModel.cohere();
const fireworks = EmbeddingModel.fireworks();
```

### Provider Factory Methods

| Method | Default Model | Default Dimensions |
|---|---|---|
| `EmbeddingModel.openai()` | `text-embedding-3-small` | 1536 |
| `EmbeddingModel.together()` | `togethercomputer/m2-bert-80M-8k-retrieval` | 768 |
| `EmbeddingModel.cohere()` | `embed-v4.0` | 1024 |
| `EmbeddingModel.fireworks()` | `nomic-ai/nomic-embed-text-v1.5` | 768 |

### Properties

| Property | Type | Description |
|---|---|---|
| `.modelId` | `string` | The model identifier. |
| `.dimensions` | `number` | Output vector dimensionality. |

### `await model.embed(texts: string[]): Promise<number[][]>`

Embed one or more texts, returning a nested array of float vectors.

```typescript
const result = await model.embed(['Hello', 'World']);
console.log(result.length);       // 2
console.log(result[0].length);    // 1536
```

### `EmbeddingModel.tract(modelUrl: string, tokenizerUrl: string, options?: TractOptions | null): Promise<EmbeddingModel>`

Local embedding via `tract-onnx` — pure-Rust ONNX inference that runs entirely inside the WASM module with no JS libraries required. Both URLs are fetched via `web_sys::fetch` because the `hf-hub` crate is not available on `wasm32`; both endpoints must respond with CORS headers permitting the calling origin.

```typescript
import { EmbeddingModel, TractOptions } from '@blazen/sdk';

const opts = new TractOptions();
opts.modelName = 'BGESmallENV15';
const embedder = await EmbeddingModel.tract(
  'https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/onnx/model.onnx',
  'https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/tokenizer.json',
  opts,
);
const vecs = await embedder.embed(['Hello world']);
```

---

## TractEmbedModel

Standalone wasm-only ONNX embedding model. The same backend powers `EmbeddingModel.tract(...)`, but `TractEmbedModel` is exposed directly for callers who want the typed class without going through the generic `EmbeddingModel` factory.

### `TractEmbedModel.create(modelUrl: string, tokenizerUrl: string, options?: TractOptions | null): Promise<TractEmbedModel>`

Async constructor. The ONNX weights and `tokenizer.json` are fetched over HTTP via `web_sys::fetch` (the `hf-hub` crate doesn't compile to `wasm32`). `modelUrl` should point to a raw ONNX protobuf; `tokenizerUrl` to a HuggingFace-format `tokenizer.json`. Both URLs must be CORS-enabled.

```typescript
import { TractEmbedModel, TractOptions } from '@blazen/sdk';

const opts = new TractOptions();
opts.modelName = 'BGESmallENV15';
opts.maxBatchSize = 32;

const model = await TractEmbedModel.create(
  'https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/onnx/model.onnx',
  'https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/tokenizer.json',
  opts,
);

console.log(model.modelId, model.dimensions);
const vectors = await model.embed(['hello', 'world']);
```

### Properties

| Property | Type | Description |
|---|---|---|
| `.modelId` | `string` | The Hugging Face model id this instance was loaded from. |
| `.dimensions` | `number` | Output embedding dimensionality. |

### `await model.embed(texts: string[]): Promise<Float32Array[]>`

Embed one or more texts. Returns a nested array of `Float32Array` vectors.

---

## MediaSource

Top-level type alias re-exporting `ImageSource` so the same `Url` / `Base64` shape is reused across image, audio, video, and file modalities:

```typescript
type MediaSource = ImageSource;
```

Use `MediaSource` in your own type annotations whenever the modality is generic; the runtime shape is identical to `ImageSource`.

---

## Token Estimation

Lightweight token counting functions available without external data files.

### `estimateTokens(text: string, contextSize?: number): number`

Estimate token count for a string (~3.5 characters per token).

```typescript
import { estimateTokens } from '@blazen/sdk';

const count = estimateTokens('Hello, world!');  // 4
```

### `countMessageTokens(messages: object[], contextSize?: number): number`

Estimate total tokens for an array of chat messages (plain objects with `role` and `content` fields). Includes per-message overhead.

```typescript
import { countMessageTokens } from '@blazen/sdk';

const count = countMessageTokens([
  { role: 'system', content: 'You are helpful.' },
  { role: 'user', content: 'Hello!' },
]);
```

`contextSize` defaults to `128000` if omitted.

---

## Custom Providers via JS Handlers

`CompletionModel` and `EmbeddingModel` can be created from JavaScript handler functions using static factory methods. This lets you implement custom providers without subclassing.

### CompletionModel.fromJsHandler

```typescript
const model = CompletionModel.fromJsHandler("my-llm", async (request) => {
  // request contains messages, tools, temperature, etc.
  return {
    content: "Hello from my custom model",
    model: "my-llm",
  };
});

const response = await model.complete([ChatMessage.user("Hi")]);
```

The handler receives a request object and should return a `CompletionResponse`-shaped object.

### EmbeddingModel.fromJsHandler

```typescript
const embedder = EmbeddingModel.fromJsHandler("my-embedder", 128, async (texts) => {
  return texts.map(() => new Array(128).fill(0.1));
});

const result = await embedder.embed(["Hello", "World"]);
```

The handler receives a `string[]` and should return a `number[][]` of embeddings.

---

## Per-Capability Provider Classes

Seven provider classes let you implement a single compute capability by passing handler functions to the constructor.

| Class | Constructor Handler | Description |
|---|---|---|
| `TTSProvider` | `(request) => Promise<any>` | Text-to-speech synthesis |
| `MusicProvider` | `{ generateMusic, generateSfx }` | Music and sound effect generation |
| `ImageProvider` | `{ generateImage, upscaleImage }` | Image generation and upscaling |
| `VideoProvider` | `{ textToVideo, imageToVideo }` | Video generation |
| `ThreeDProvider` | `(request) => Promise<any>` | 3D model generation |
| `BackgroundRemovalProvider` | `(request) => Promise<any>` | Background removal |
| `VoiceProvider` | `{ cloneVoice, listVoices, deleteVoice }` | Voice cloning and management |

### Constructor

Single-method providers take a provider ID and a handler function:

```typescript
const tts = new TTSProvider("elevenlabs", async (request) => {
  const audio = await elevenlabs.textToSpeech(request);
  return { audioData: audio, format: "mp3" };
});

const result = await tts.textToSpeech({ text: "Hello world", voice: "alice" });
```

Multi-method providers take a provider ID and a handlers object:

```typescript
const music = new MusicProvider("suno", {
  generateMusic: async (request) => { /* ... */ },
  generateSfx: async (request) => { /* ... */ },
});

const image = new ImageProvider("dalle", {
  generateImage: async (request) => { /* ... */ },
  upscaleImage: async (request) => { /* ... */ },
});

const video = new VideoProvider("runway", {
  textToVideo: async (request) => { /* ... */ },
  imageToVideo: async (request) => { /* ... */ },
});

const voice = new VoiceProvider("elevenlabs", {
  cloneVoice: async (request) => { /* ... */ },
  listVoices: async () => { /* ... */ },
  deleteVoice: async (voiceId) => { /* ... */ },
});
```

---

## MemoryBackend

Custom memory storage backends are created by passing handler functions to the `MemoryBackend` constructor.

```typescript
const backend = new MemoryBackend({
  put: async (entry) => { /* store entry */ },
  get: async (id) => { /* retrieve by id, return null if missing */ },
  delete: async (id) => { /* delete by id, return true if existed */ },
  list: async () => { /* return all entries */ },
  len: async () => { /* return entry count */ },
  searchByBands: async (bands, limit) => { /* return candidates */ },
});
```

### Handler Methods

| Method | Signature | Description |
|---|---|---|
| `put` | `(entry: any) => Promise<void>` | Insert or update a stored entry. |
| `get` | `(id: string) => Promise<any \| null>` | Retrieve a stored entry by id. |
| `delete` | `(id: string) => Promise<boolean>` | Delete an entry by id. Returns `true` if it existed. |
| `list` | `() => Promise<any[]>` | Return all stored entries. |
| `len` | `() => Promise<number>` | Return the number of stored entries. |
| `searchByBands` | `(bands: any, limit: number) => Promise<any[]>` | Return candidate entries sharing at least one LSH band. |

---

## InMemoryBackend

A typed, Rust-native in-memory `MemoryBackend` implementation. Unlike `MemoryBackend` (which round-trips every call through user-supplied JS callbacks), `InMemoryBackend` keeps reads and writes inside the WASM linear memory — no JS overhead per call.

```typescript
import { InMemoryBackend, Memory, EmbeddingModel } from '@blazen/sdk';

const backend = new InMemoryBackend();
const embedder = EmbeddingModel.openai();
const memory = Memory.fromBackend(embedder, backend);
await memory.add('doc1', 'hello world', null);
```

### Methods

| Method | Signature | Description |
|---|---|---|
| `put` | `(entry: WasmStoredEntry) => Promise<void>` | Insert or update a stored entry. |
| `get` | `(id: string) => Promise<WasmStoredEntry \| null>` | Retrieve a stored entry by id. |
| `delete` | `(id: string) => Promise<boolean>` | Delete an entry by id. Returns `true` if it existed. |
| `list` | `() => Promise<WasmStoredEntry[]>` | Return all stored entries. |
| `len` | `() => Promise<number>` | Return the number of stored entries. |
| `isEmpty` | `() => Promise<boolean>` | `true` if the backend contains no entries. |
| `searchByBands` | `(bands: string[], limit: number) => Promise<WasmStoredEntry[]>` | Return candidate entries sharing at least one LSH band. |

### Memory factory methods

| Factory | Signature | Description |
|---|---|---|
| `Memory.fromBackend` | `(embedder: EmbeddingModel, backend: InMemoryBackend) => Memory` | Full-mode memory (embedding-based search) backed by a typed `InMemoryBackend`. |
| `Memory.localFromBackend` | `(backend: InMemoryBackend) => Memory` | Local-only mode (`SimHash` only, `searchLocal()` available; `search()` rejects) backed by a typed `InMemoryBackend`. |

```typescript
const localMem = Memory.localFromBackend(new InMemoryBackend());
await localMem.add('doc', 'hello world', null);
const hits = await localMem.searchLocal('hello', 5, null);
```

---

## MemoryResult

Standalone class representing a single result returned by `Memory.search()` / `Memory.searchLocal()`. Exposed primarily as a typed return value for downstream code that wants to construct `MemoryResult`s from JS (e.g. when implementing a custom `MemoryStore`).

### Constructor

```typescript
new MemoryResult(id: string, text: string, score: number, metadata: any);
```

### Properties

| Property | Type | Description |
|---|---|---|
| `.id` | `string` | The entry identifier. |
| `.text` | `string` | The stored text content. |
| `.score` | `number` | Similarity score in `[0, 1]`, higher means more similar. |
| `.metadata` | `any` | Arbitrary metadata, decoded from JSON to a JS value. |

---

## ModelManager

VRAM budget-aware model manager with LRU eviction. Not typically used in WASM (where GPU model loading is uncommon), but available for tracking model state.

> **Backed by the real `blazen_manager::ModelManager`.** Method names match the native and Node bindings; the WASM constructor takes a single `number` of gigabytes as a positional argument (no options object). Unlike the Node binding, WASM byte-quantity getters return plain `number` (`f64`) — JS doubles carry 53 bits of mantissa, more than enough for any realistic VRAM budget — so there is no BigInt migration on this surface.

### Constructor

```typescript
const manager = new ModelManager(8);   // 8 GB budget
```

| Argument | Type | Description |
|---|---|---|
| `budgetGb` | `number` | VRAM budget in gigabytes. Converted to bytes internally (`budgetGb * 1_073_741_824`). |

### Methods

| Method | Signature | Description |
|---|---|---|
| `register` | `await manager.register(id, model, vramEstimate, lifecycle)` | Register a model with its estimated VRAM footprint (`vramEstimate: number` bytes) and a JS lifecycle object exposing async `load()` / `unload()` methods. |
| `load` | `await manager.load(id)` | Load a model, evicting LRU models if needed. |
| `unload` | `await manager.unload(id)` | Unload a model and free its VRAM. |
| `isLoaded` | `await manager.isLoaded(id): boolean` | Check if a model is currently loaded. |
| `ensureLoaded` | `await manager.ensureLoaded(id)` | Alias for `load()`. |
| `usedBytes` | `await manager.usedBytes(): number` | Total VRAM currently used by loaded models, in bytes. |
| `availableBytes` | `await manager.availableBytes(): number` | Available VRAM within the budget, in bytes. |
| `status` | `await manager.status(): { id: string; loaded: boolean; vramEstimate: number }[]` | Status of all registered models. |

### Properties

| Property | Type | Description |
|---|---|---|
| `.budgetBytes` | `number` | Read-only getter returning the configured budget in bytes (`budgetGb * 1_073_741_824`). |

---

## ModelRegistry

JS-callback ABC for advertising a model catalog. Wraps a JS object implementing `listModels()` and `getModel(modelId)` so browser code can plug a custom registry into Blazen's model-info lookup surface. Mirrors the trait at `blazen_llm::traits::ModelRegistry` and reaches parity with `PyModelRegistry` (Python) / `JsModelRegistry` (Node).

### Constructor

```typescript
import init, { ModelRegistry } from "@blazen/sdk";
import type { ModelInfo } from "@blazen/sdk";

await init();

const registry = new ModelRegistry({
  async listModels(): Promise<ModelInfo[]> {
    const res = await fetch("/api/models");
    return res.json();
  },
  async getModel(modelId: string): Promise<ModelInfo | null> {
    const res = await fetch(`/api/models/${modelId}`);
    return res.ok ? res.json() : null;
  },
});

const models = await registry.listModels();
```

The constructor argument must implement the `ModelRegistryImpl` interface (auto-emitted into `crates/blazen-wasm-sdk/pkg/blazen_wasm_sdk.d.ts`):

```typescript
export interface ModelRegistryImpl {
  listModels(): Promise<ModelInfo[]> | ModelInfo[];
  getModel(modelId: string): Promise<ModelInfo | null> | ModelInfo | null;
}
```

Both methods may return either a `Promise` or a synchronous value; the binding awaits whichever is produced.

### Methods

| Method | Signature | Description |
|---|---|---|
| `listModels` | `await registry.listModels(): Promise<ModelInfo[]>` | Returns whatever the JS `listModels()` callback resolved to. |
| `getModel` | `await registry.getModel(modelId: string): Promise<ModelInfo \| null>` | Returns whatever the JS `getModel()` callback resolved to, or `null` if the model is unknown. |

The registry returns plain `ModelInfo` objects — the same tsify-generated shape produced elsewhere on the WASM surface and documented in `crates/blazen-wasm-sdk/pkg/blazen_wasm_sdk.d.ts`.

---

## Pricing Functions

### registerPricing()

Register custom pricing for a model.

```typescript
import { registerPricing } from "@blazen/sdk";

registerPricing("my-model", 1.0, 2.0);
// Arguments: modelId, inputPerMillion, outputPerMillion
```

### lookupPricing()

Look up pricing for a model by ID. Returns `null` if the model is unknown.

```typescript
import { lookupPricing } from "@blazen/sdk";

const pricing = lookupPricing("gpt-4o");
if (pricing) {
  console.log(`Input: $${pricing.inputPerMillion}/M tokens`);
}
```

The returned object has the shape:

```typescript
{
  inputPerMillion: number;
  outputPerMillion: number;
}
```

---

## OTLP Telemetry

OpenTelemetry trace export over HTTP/protobuf. Behind the `otlp-http` Cargo feature on the `blazen-wasm-sdk` crate (the default `opentelemetry-otlp/grpc-tonic` transport is wasm-incompatible because `tonic` requires `tokio` networking that does not exist on `wasm32`). The WASM build instead routes spans through a custom `WasmFetchHttpClient` that posts protobuf bodies via `web_sys::fetch`.

### `new OtlpConfig(endpoint: string, serviceName: string)`

| Argument | Type | Description |
|---|---|---|
| `endpoint` | `string` | Full HTTP/protobuf traces endpoint, e.g. `"http://localhost:4318/v1/traces"`. |
| `serviceName` | `string` | Reported to the backend as the `service.name` resource attribute. |

Read-only getters: `.endpoint`, `.serviceName`.

### `initOtlp(config: OtlpConfig): void`

Install the global OTLP exporter and a `tracing-subscriber` stack with an OpenTelemetry layer. Must be called **once** at startup; subsequent calls fail because the global subscriber can only be installed a single time.

```typescript
import init, { OtlpConfig, initOtlp } from '@blazen/sdk';

await init();
const cfg = new OtlpConfig('http://localhost:4318/v1/traces', 'my-wasm-app');
initOtlp(cfg);
// All subsequent workflow / pipeline / completion spans are exported.
```

If the collector is unreachable the export simply drops spans; it never blocks the calling workflow.

---

## Error Handling

All errors are thrown as JavaScript `Error` objects. The message format indicates the category:

| Error Pattern | Description |
|---|---|
| `"authentication failed: ..."` | Invalid or expired API key |
| `"rate limited"` | Provider rate limit hit |
| `"timed out after {ms}ms"` | Request timed out |
| `"{provider} error: ..."` | Provider-specific error |
| `"invalid input: ..."` | Validation error |
| `"unsupported: ..."` | Feature not supported by provider |

```typescript
try {
  const response = await model.complete([ChatMessage.user('Hello')]);
} catch (e) {
  if (e.message.startsWith('rate limited')) {
    // Back off and retry
  }
}
```

---

# Rust Examples

Source: https://blazen.dev/docs/examples/rust
Language: rust
Section: examples

# Rust Examples

Four complete, runnable examples that demonstrate core Blazen workflow patterns.

---

## Basic Workflow

A 3-step sequential pipeline: **StartEvent** → **GreetEvent** → **StopEvent**.

```rust
#[derive(Debug, Clone, Serialize, Deserialize, Event)]
struct GreetEvent { name: String }

#[step]
async fn parse_input(event: StartEvent, _ctx: Context) -> Result<GreetEvent, WorkflowError> {
    Ok(GreetEvent { name: event.data["name"].as_str().unwrap_or("World").to_string() })
}
```

```sh
cargo run -p blazen --example basic_workflow
```

---

## Streaming Workflow

Publishes progress events while processing, observable via `stream_events()`.

```rust
ctx.write_event_to_stream(ProgressEvent { step: i, message: format!("Step {}", i) });
```

```sh
cargo run -p blazen --example streaming_workflow
```

---

## Branching Workflow

Conditional routing based on sentiment analysis using `#[step(emits = [...])]`.

```rust
#[step(emits = [PositiveEvent, NegativeEvent])]
async fn classify(event: AnalyzeEvent, _ctx: Context) -> Result<StepOutput, WorkflowError> {
    // route to PositiveEvent or NegativeEvent based on sentiment
}
```

```sh
cargo run -p blazen --example branching_workflow
```

---

## LLM RAG Workflow

Multi-step RAG pipeline using context for shared state between steps.

```rust
// Typed JSON via set/get
ctx.set("documents", serde_json::json!(docs));
let docs = ctx.get("documents").unwrap();

// Direct StateValue access for cross-language or binary data
ctx.set_value("embeddings", StateValue::Bytes(embedding_bytes.into()));
```

```sh
cargo run -p blazen --example llm_rag_workflow
```

---

## Custom CompletionModel (trait impl)

In Rust, custom providers are built by implementing the `CompletionModel` trait. The trait-impl is a first-class citizen -- it works with `run_agent`, `with_retry`, `with_cache`, and every other helper.

```rust
use async_trait::async_trait;
use futures::stream::{self, Stream};
use std::pin::Pin;

use blazen_llm::{
    BlazenError, CompletionRequest, CompletionResponse, StreamChunk,
    traits::CompletionModel,
    types::Role,
};

struct EchoLLM;

#[async_trait]
impl CompletionModel for EchoLLM {
    fn model_id(&self) -> &str {
        "echo-llm"
    }

    async fn complete(
        &self,
        request: CompletionRequest,
    ) -> Result<CompletionResponse, BlazenError> {
        let last = request
            .messages
            .iter()
            .rev()
            .find(|m| m.role == Role::User)
            .and_then(|m| m.content.text_content())
            .unwrap_or_default();

        Ok(CompletionResponse {
            content: Some(format!("echo: {last}")),
            tool_calls: Vec::new(),
            reasoning: None,
            citations: Vec::new(),
            artifacts: Vec::new(),
            usage: None,
            model: self.model_id().to_string(),
            finish_reason: Some("stop".to_string()),
            cost: None,
            timing: None,
            images: Vec::new(),
            audio: Vec::new(),
            videos: Vec::new(),
            metadata: serde_json::Value::Null,
        })
    }

    async fn stream(
        &self,
        request: CompletionRequest,
    ) -> Result<
        Pin<Box<dyn Stream<Item = Result<StreamChunk, BlazenError>> + Send>>,
        BlazenError,
    > {
        let response = self.complete(request).await?;
        let content = response.content.unwrap_or_default();
        let chunks: Vec<Result<StreamChunk, BlazenError>> = content
            .split(' ')
            .map(|word| {
                Ok(StreamChunk {
                    delta: Some(format!("{word} ")),
                    ..Default::default()
                })
            })
            .collect();
        Ok(Box::pin(stream::iter(chunks)))
    }
}
```

```sh
cargo run -p blazen --example custom_completion_model
```

---

## Custom MemoryBackend (trait impl)

Implement the `MemoryBackend` trait from `blazen-memory` to plug in any storage layer (Postgres, SQLite, DynamoDB, a `DashMap`). The reference `InMemoryBackend` is already provided; this example shows the pattern for a custom one.

```rust
use std::collections::HashMap;
use std::sync::Arc;

use async_trait::async_trait;
use tokio::sync::RwLock;

use blazen_memory::{Memory, MemoryBackend, MemoryError, StoredEntry};

struct DictBackend {
    store: RwLock<HashMap<String, StoredEntry>>,
}

impl DictBackend {
    fn new() -> Self {
        Self {
            store: RwLock::new(HashMap::new()),
        }
    }
}

#[async_trait]
impl MemoryBackend for DictBackend {
    async fn put(&self, entry: StoredEntry) -> Result<(), MemoryError> {
        self.store.write().await.insert(entry.id.clone(), entry);
        Ok(())
    }

    async fn get(&self, id: &str) -> Result<Option<StoredEntry>, MemoryError> {
        Ok(self.store.read().await.get(id).cloned())
    }

    async fn delete(&self, id: &str) -> Result<bool, MemoryError> {
        Ok(self.store.write().await.remove(id).is_some())
    }

    async fn list(&self) -> Result<Vec<StoredEntry>, MemoryError> {
        Ok(self.store.read().await.values().cloned().collect())
    }

    async fn len(&self) -> Result<usize, MemoryError> {
        Ok(self.store.read().await.len())
    }

    async fn search_by_bands(
        &self,
        bands: &[String],
        limit: usize,
    ) -> Result<Vec<StoredEntry>, MemoryError> {
        let set: std::collections::HashSet<_> = bands.iter().cloned().collect();
        Ok(self
            .store
            .read()
            .await
            .values()
            .filter(|e| e.bands.iter().any(|b| set.contains(b)))
            .take(limit)
            .cloned()
            .collect())
    }
}

// Usage -- the custom backend is a drop-in for the built-in ones:
// let embedder = Arc::new(
//     blazen_embed::EmbedModel::from_options(blazen_embed::EmbedOptions::default()).await?,
// ) as Arc<dyn blazen_llm::EmbeddingModel>;
// let memory = Memory::new(embedder, DictBackend::new());
```

```sh
cargo run -p blazen --example custom_memory_backend
```

---

## ModelManager with VRAM Budget

The `blazen-manager` crate tracks VRAM across multiple local models and runs LRU eviction when loading would exceed the budget.

```rust
use std::sync::Arc;

use blazen_manager::ModelManager;
use blazen_llm::LocalModel;

// Replace with your local model constructors (mistral.rs, llama.cpp, candle).
async fn load_local_model(_id: &str) -> Arc<dyn LocalModel> {
    unimplemented!("construct your local model here")
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // 24 GB budget (GPU-typical for a consumer card).
    let manager = ModelManager::with_budget_gb(24.0);

    let llama_8b = load_local_model("llama-8b").await;
    let qwen_14b = load_local_model("qwen-14b").await;
    let mistral_24b = load_local_model("mistral-24b").await;

    manager.register("llama-8b", llama_8b, 8 * 1024 * 1024 * 1024).await;
    manager.register("qwen-14b", qwen_14b, 14 * 1024 * 1024 * 1024).await;
    manager.register("mistral-24b", mistral_24b, 20 * 1024 * 1024 * 1024).await;

    // Fits alongside qwen-14b (8 + 14 = 22 GB).
    manager.load("llama-8b").await?;
    manager.load("qwen-14b").await?;

    // 20 GB does not fit next to 8 + 14 = 22 GB -- LRU (llama-8b) is evicted.
    manager.load("mistral-24b").await?;

    for s in manager.status().await {
        println!(
            "{}: loaded={}, vram={} bytes",
            s.id, s.loaded, s.vram_estimate
        );
    }
    Ok(())
}
```

```sh
cargo run -p blazen --example model_manager_budget
```

---

## Pricing Registration and Cost Tracking

Register pricing for any model ID (your own model, a local finetune, a custom deployment). Every `CompletionResponse` carries a `.cost` field computed from the registered rate.

```rust
use blazen_llm::{
    ChatMessage, PricingEntry,
    providers::openai_compat::{AuthMethod, OpenAiCompatConfig, OpenAiCompatProvider},
    register_pricing, traits::CompletionModel, types::CompletionRequest,
};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Register pricing once, globally, for any model ID.
    register_pricing(
        "my-finetuned-model",
        PricingEntry {
            input_per_million: 1.0,
            output_per_million: 2.0,
        },
    );

    // Point a provider at your deployment using the registered ID.
    let model = OpenAiCompatProvider::new(OpenAiCompatConfig {
        provider_name: "local".to_string(),
        base_url: "http://localhost:8080/v1".to_string(),
        api_key: "local".to_string(),
        default_model: "my-finetuned-model".to_string(),
        auth_method: AuthMethod::Bearer,
        extra_headers: Vec::new(),
        query_params: Vec::new(),
        supports_model_listing: true,
    });

    let request = CompletionRequest::new(vec![
        ChatMessage::user("Summarize Rust in one line."),
    ]);

    let response = model.complete(request).await?;
    println!("{}", response.content.unwrap_or_default());
    println!("usage: {:?}", response.usage);
    if let Some(cost) = response.cost {
        println!("cost: ${cost:.6}"); // computed from registered pricing
    }
    Ok(())
}
```

```sh
cargo run -p blazen --example pricing_and_cost
```

---

## Custom TTS Provider (AudioGeneration trait)

For Rust, per-capability custom providers are built by implementing the capability trait from `blazen-llm::compute` (e.g. `AudioGeneration`, `ImageGeneration`, `VideoGeneration`). Every capability trait extends `ComputeProvider`, so you implement both.

```rust
use async_trait::async_trait;

use blazen_llm::{
    BlazenError, GeneratedAudio, MediaOutput, MediaType, RequestTiming,
    compute::{
        AudioGeneration, AudioResult, ComputeProvider, ComputeRequest, ComputeResult,
        JobHandle, JobStatus, SpeechRequest,
    },
};

struct MyElevenLabs {
    api_key: String,
}

#[async_trait]
impl ComputeProvider for MyElevenLabs {
    fn provider_id(&self) -> &str {
        "elevenlabs"
    }

    // TTS is synchronous -- we don't use the submit/status/result flow.
    // Mark those endpoints as unsupported so callers can't accidentally
    // queue jobs.
    async fn submit(&self, _r: ComputeRequest) -> Result<JobHandle, BlazenError> {
        Err(BlazenError::unsupported("use text_to_speech() directly"))
    }
    async fn status(&self, _j: &JobHandle) -> Result<JobStatus, BlazenError> {
        Err(BlazenError::unsupported("no job queue"))
    }
    async fn result(&self, _j: JobHandle) -> Result<ComputeResult, BlazenError> {
        Err(BlazenError::unsupported("no job queue"))
    }
    async fn cancel(&self, _j: &JobHandle) -> Result<(), BlazenError> {
        Err(BlazenError::unsupported("no job queue"))
    }
}

#[async_trait]
impl AudioGeneration for MyElevenLabs {
    async fn text_to_speech(
        &self,
        request: SpeechRequest,
    ) -> Result<AudioResult, BlazenError> {
        // In a real implementation, make an HTTP call with self.api_key.
        let _api_key = &self.api_key;
        Ok(AudioResult {
            audio: vec![GeneratedAudio {
                media: MediaOutput::from_base64("AAEC", MediaType::Wav),
                duration_seconds: None,
                sample_rate: Some(44_100),
                channels: Some(1),
            }],
            timing: RequestTiming::default(),
            cost: None,
            metadata: serde_json::json!({
                "voice": request.voice,
                "text": request.text,
            }),
        })
    }

    // generate_music / generate_sfx default to BlazenError::Unsupported.
}
```

```sh
cargo run -p blazen --example custom_tts_provider
```

---

## Langfuse Exporter

The `blazen-telemetry` crate ships a `tracing_subscriber::Layer` that maps Blazen's `workflow.run`, `workflow.step`, and `llm.complete` spans to Langfuse traces, spans, and generations. `init_langfuse` returns the `LangfuseLayer` -- you compose it into the registry yourself, which lets you stack it alongside `fmt`, `EnvFilter`, or other exporters.

`LangfuseConfig::new(public_key, secret_key)` defaults to the `cloud.langfuse.com` host, batch size 100, and a 5 s flush interval. Override any of those with the chained builders.

```rust
use blazen_telemetry::{LangfuseConfig, init_langfuse};
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::util::SubscriberInitExt;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let cfg = LangfuseConfig::new(
        std::env::var("LANGFUSE_PUBLIC_KEY")?,
        std::env::var("LANGFUSE_SECRET_KEY")?,
    )
    .with_host("https://cloud.langfuse.com")
    .with_batch_size(100)
    .with_flush_interval_ms(5000);

    let layer = init_langfuse(cfg)?;
    tracing_subscriber::registry().with(layer).init();

    // ... your workflow code ...
    Ok(())
}
```

```sh
cargo run -p blazen --example langfuse_exporter
```

---

## OTLP HTTP Exporter

`init_otlp_http` is the wasm-compatible variant of `init_otlp` (which uses gRPC + tonic). It is gated behind the `otlp-http` Cargo feature and builds the OpenTelemetry HTTP/protobuf span exporter with a target-appropriate `HttpClient` (reqwest on native, `web_sys::fetch` on `wasm32`). Unlike `init_langfuse`, this function installs the global `tracing` subscriber internally -- you do **not** need to call `tracing_subscriber::registry().init()` yourself.

`OtlpConfig` is a plain struct -- there is no `::new()` constructor; populate `endpoint` and `service_name` directly. For HTTP, point `endpoint` at the `/v1/traces` path on your collector.

```rust
use blazen_telemetry::{OtlpConfig, init_otlp_http};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let cfg = OtlpConfig {
        endpoint: "https://otel-collector.example.com:4318/v1/traces".to_string(),
        service_name: "my-service".to_string(),
    };
    init_otlp_http(cfg)?;

    // ... workflow code ...
    Ok(())
}
```

```sh
cargo run -p blazen --example otlp_http_exporter
```

---

# Python Examples

Source: https://blazen.dev/docs/examples/python
Language: python
Section: examples

# Python Examples

Six complete, runnable examples that demonstrate core Blazen workflow patterns in Python using the subclass event model.

---

## Basic Workflow

A 3-step sequential pipeline: **StartEvent** → **GreetEvent** → **FormattedEvent** → **StopEvent**.

```python
class GreetEvent(Event):
    name: str

@step
async def parse_input(ctx: Context, ev: Event):
    return GreetEvent(name=ev.name)

@step
async def greet(ctx: Context, ev: GreetEvent):
    return StopEvent(result={"greeting": f"Hello, {ev.name}!"})
```

```sh
python examples/basic_workflow.py
```

---

## Streaming Workflow

Publishes typed progress events while processing via `ctx.write_event_to_stream()`.

```python
class ProgressEvent(Event):
    step: int

ctx.write_event_to_stream(ProgressEvent(step=i))
async for event in handler.stream_events():
    print(event.event_type)
```

```sh
python examples/streaming_workflow.py
```

---

## Branching Workflow

Conditional fan-out by returning a list of typed events.

```python
class PositiveEvent(Event):
    text: str
class NegativeEvent(Event):
    text: str

return [PositiveEvent(text=text), NegativeEvent(text=text)]
```

```sh
python examples/branching_workflow.py
```

---

## LLM RAG Workflow

Multi-step RAG pipeline with context sharing between steps. Uses typed `ChatMessage` and `CompletionResponse`:

```python
from blazen import CompletionModel, ChatMessage, CompletionResponse

# Reads OPENAI_API_KEY from the environment by default.
model = CompletionModel.openai()
response: CompletionResponse = await model.complete([
    ChatMessage.system("Answer based on the provided documents."),
    ChatMessage.user(query),
])
print(response.content)       # typed attribute access
print(response.usage)         # TokenUsage with .prompt_tokens, .completion_tokens, .total_tokens
ctx.set("documents", docs)
```

```sh
python examples/llm_rag_workflow.py
```

---

## Human-in-the-Loop

Side-effect steps that pause for external input with typed review events.

```python
class ReviewComplete(Event):
    pass

ctx.send_event(ReviewComplete())
return None
```

```sh
python examples/human_in_the_loop.py
```

---

## Stateful Workflow

Showcases the two explicit context namespaces and identity-preserving event payloads:

- `ctx.state` for persistable values (counters, paths) -- survives `pause()`/`resume()`.
- `ctx.session` for live in-process references (DB connections, handles) -- excluded from snapshots.
- `StopEvent(result=conn)` preserves `is`-identity -- the caller gets back the same Python object.

```python
import sqlite3
from blazen import Workflow, step, Context, StartEvent, StopEvent

@step
async def setup(ctx: Context, ev: StartEvent):
    conn = sqlite3.connect(":memory:")
    ctx.state["row_count"] = 0
    ctx.session["db"] = conn  # identity preserved
    return StopEvent(result=conn)

# result = await handler.result()
# assert result.result is conn  # same Python object
```

```sh
python examples/stateful_workflow.py
```

---

## Example 7: Subclassing CompletionModel (Custom Provider)

Build a custom provider by subclassing `CompletionModel`. Override `complete()` and/or `stream()` to plug in any backend (local inference, a proxy, a mock, etc.). Once subclassed the model is a first-class citizen -- it works with `run_agent`, `with_retry()`, `with_cache()`, and every other helper.

```python
import asyncio

from blazen import ChatMessage, CompletionModel, run_agent


class EchoLLM(CompletionModel):
    """A toy provider that echoes the last user message back."""

    def __init__(self):
        super().__init__(model_id="echo-llm", context_length=4096)

    async def complete(self, messages, options=None):
        last = next(
            (m.content for m in reversed(messages) if m.role == "user"), ""
        )
        # Return a dict -- the dispatcher depythonizes into CompletionResponse.
        # Only `content` and `model` are required; every other field defaults.
        return {"content": f"echo: {last}", "model": self.model_id}


async def main():
    model = EchoLLM()

    # Works with run_agent just like any built-in provider.
    result = await run_agent(
        model,
        [ChatMessage.user("hello world")],
        tools=[],
    )
    print(result.response.content)  # -> "echo: hello world"


asyncio.run(main())
```

```sh
python examples/subclass_completion_model.py
```

---

## Example 8: Custom MemoryBackend (DictBackend)

Subclass `MemoryBackend` to plug in any storage layer -- Postgres, DynamoDB, SQLite, a plain dict. Every async method is dispatched from Rust back into Python, so you get full control while reusing Blazen's embedding and SimHash search pipeline.

```python
import asyncio

from blazen import EmbeddingModel, Memory, MemoryBackend


class DictBackend(MemoryBackend):
    """In-process dict backed memory backend."""

    def __init__(self):
        super().__init__()
        self._store: dict[str, dict] = {}

    async def put(self, entry):
        self._store[entry["id"]] = entry

    async def get(self, entry_id):
        return self._store.get(entry_id)

    async def delete(self, entry_id):
        return self._store.pop(entry_id, None) is not None

    async def list(self):
        return list(self._store.values())

    async def len(self):
        return len(self._store)

    async def search_by_bands(self, bands, limit):
        # Return any entry that shares at least one LSH band with the query.
        band_set = set(bands)
        hits = [
            e for e in self._store.values()
            if band_set.intersection(e.get("bands", []))
        ]
        return hits[:limit]


async def main():
    embedder = EmbeddingModel.local()
    memory = Memory(embedder, DictBackend())

    await memory.add("fact-1", "Rust has ownership and borrowing.")
    await memory.add("fact-2", "Python uses reference counting.")

    results = await memory.search("memory management", limit=2)
    for r in results:
        print(r.score, r.text)


asyncio.run(main())
```

```sh
python examples/custom_memory_backend.py
```

---

## Example 9: ModelManager with VRAM Budget

Track VRAM across multiple local models. Register each model with an estimated footprint; when loading a new one would exceed the budget, the least-recently-used model is automatically unloaded.

```python
import asyncio

from blazen import CompletionModel, MistralRsOptions, ModelManager


async def main():
    # 24 GB budget (GPU-typical for a consumer card).
    manager = ModelManager(budget_gb=24)

    llama_8b = CompletionModel.mistralrs(
        options=MistralRsOptions("meta-llama/Llama-3.1-8B-Instruct"),
    )
    qwen_14b = CompletionModel.mistralrs(
        options=MistralRsOptions("Qwen/Qwen2.5-14B-Instruct"),
    )
    mistral_24b = CompletionModel.mistralrs(
        options=MistralRsOptions("mistralai/Mistral-Small-24B"),
    )

    await manager.register("llama-8b", llama_8b, vram_estimate_bytes=8 * 1024**3)
    await manager.register("qwen-14b", qwen_14b, vram_estimate_bytes=14 * 1024**3)
    await manager.register("mistral-24b", mistral_24b, vram_estimate_bytes=20 * 1024**3)

    # Fits alongside qwen-14b (8 + 14 = 22 GB).
    await manager.load("llama-8b")
    await manager.load("qwen-14b")

    # 20 GB does not fit next to 8 + 14 = 22 GB -- LRU (llama-8b) is evicted.
    await manager.load("mistral-24b")

    for s in await manager.status():
        print(f"{s.id}: loaded={s.loaded}, vram={s.vram_estimate:,} bytes")


asyncio.run(main())
```

```sh
python examples/model_manager_budget.py
```

---

## Example 10: Pricing Registration and Cost Tracking

Register pricing for any model ID (your own model, a local finetune, a custom deployment). Every `CompletionResponse` then carries a `.cost` field computed from the registered rate.

```python
import asyncio

from blazen import (
    ChatMessage,
    CompletionModel,
    ModelPricing,
    lookup_pricing,
    register_pricing,
)


class MyFinetune(CompletionModel):
    """A stand-in for a custom deployment that reports its usage."""

    def __init__(self):
        super().__init__(model_id="my-finetuned-model")

    async def complete(self, messages, options=None):
        return {
            "content": "Rust is a systems language with memory safety without GC.",
            "model": self.model_id,
            "tool_calls": [],
            "citations": [],
            "artifacts": [],
            "images": [],
            "audio": [],
            "videos": [],
            "usage": {
                "prompt_tokens": 150,
                "completion_tokens": 80,
                "total_tokens": 230,
            },
            "metadata": {},
        }


async def main():
    # Register pricing once, globally, for any model ID.
    register_pricing(
        "my-finetuned-model",
        ModelPricing(input_per_million=1.0, output_per_million=2.0),
    )

    # Readback -- pricing is centrally stored.
    pricing = lookup_pricing("my-finetuned-model")
    assert pricing is not None
    print(f"input: ${pricing.input_per_million}/M, output: ${pricing.output_per_million}/M")

    model = MyFinetune()
    response = await model.complete([ChatMessage.user("Summarize Rust in one line.")])
    print(response.content)
    print(f"usage: {response.usage}")
    # cost is computed from the registered pricing + usage.
    print(f"cost: ${response.cost:.6f}")


asyncio.run(main())
```

```sh
python examples/pricing_and_cost.py
```

---

## Example 11: Per-Capability Provider (Custom TTS)

Subclass `TTSProvider` to plug in any TTS backend (ElevenLabs, Coqui, a local model). The per-capability base classes (`TTSProvider`, `ImageProvider`, `VideoProvider`, `MusicProvider`, `ThreeDProvider`, `BackgroundRemovalProvider`, `VoiceProvider`) exist for users who only need to implement one capability.

```python
import asyncio

from blazen import TTSProvider, SpeechRequest


class MyElevenLabs(TTSProvider):
    """A minimal custom TTS provider."""

    def __init__(self, api_key: str):
        super().__init__(
            provider_id="elevenlabs",
            base_url="https://api.elevenlabs.io/v1",
        )
        self._api_key = api_key

    async def text_to_speech(self, request):
        # In a real implementation, make an HTTP call with self._api_key
        # and return audio bytes. Here we just echo the request.
        return {
            "audio_data": b"<wav-bytes>",
            "format": "wav",
            "voice": request.voice,
            "text": request.text,
        }


async def main():
    tts = MyElevenLabs(api_key="sk-...")
    result = await tts.text_to_speech(
        SpeechRequest(text="Hello from Blazen!", voice="alice")
    )
    print(result["format"], len(result["audio_data"]), "bytes")


asyncio.run(main())
```

```sh
python examples/custom_tts_provider.py
```

---

## Example 12: Typed Error Handling

Blazen exposes a typed exception hierarchy rooted at `BlazenError`. `ProviderError` carries structured HTTP context (`provider`, `status`, `endpoint`, `request_id`, `detail`, `raw_body`, `retry_after_ms`) so callers can branch on the failure mode instead of regex-matching error strings.

```python
import asyncio

from blazen import (
    BlazenError,
    ChatMessage,
    CompletionModel,
    ProviderError,
    RateLimitError,
)


async def main():
    model = CompletionModel.openai()
    try:
        response = await model.complete([ChatMessage.user("Hello")])
        print(response.content)
    except RateLimitError:
        print("rate limited; backing off")
    except ProviderError as e:
        print(f"provider {e.provider} returned {e.status}: {e.detail}")
    except BlazenError as e:
        print(f"blazen error: {e}")


asyncio.run(main())
```

```sh
python examples/typed_error_handling.py
```

---

## Example 13: Custom Progress Reporting via ProgressCallback

Subclass `ProgressCallback` to receive download progress notifications from `ModelCache.download()`. The same instance can be reused across multiple downloads -- the cache calls `on_progress(downloaded, total)` repeatedly as bytes arrive.

```python
import asyncio

from blazen import ModelCache, ProgressCallback


class TerminalProgress(ProgressCallback):
    def on_progress(self, downloaded: int, total: int | None) -> None:
        if total:
            pct = downloaded / total * 100
            print(f"\r{pct:.1f}% ({downloaded}/{total})", end="", flush=True)
        else:
            print(f"\r{downloaded} bytes", end="", flush=True)


async def main():
    cache = ModelCache()
    path = await cache.download(
        "mistralai/Mistral-7B-Instruct-v0.3",
        "config.json",
        TerminalProgress(),
    )
    print(f"\ndownloaded to {path}")


asyncio.run(main())
```

```sh
python examples/progress_callback.py
```

---

## Example 14: Telemetry -- Langfuse Trace Export

Wire Blazen into Langfuse with a single `init_langfuse()` call. Once configured, every workflow run, agent step, and LLM call ships traces to Langfuse in the background. Tune `batch_size` and `flush_interval_ms` to balance latency against API request volume.

```python
import os

from blazen import LangfuseConfig, init_langfuse


init_langfuse(
    LangfuseConfig(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host="https://cloud.langfuse.com",
        batch_size=100,
        flush_interval_ms=5000,
    )
)

# All subsequent workflow runs ship traces to Langfuse.
```

```sh
python examples/langfuse_telemetry.py
```

---

# Node.js Examples

Source: https://blazen.dev/docs/examples/node
Language: node
Section: examples

# Node.js Examples

Six complete, runnable examples that demonstrate core Blazen workflow patterns.

---

## Basic Workflow

A 3-step sequential pipeline: **StartEvent** → **GreetEvent** → **FormattedEvent** → **StopEvent**.

```javascript
wf.addStep("greet", ["GreetEvent"], async (event, ctx) => {
  return { type: "blazen::StopEvent", result: { greeting: `Hello, ${event.name}!` } };
});
```

```sh
npx tsx examples/basic_workflow.ts
```

---

## Streaming Workflow

Publishes progress events while processing via `ctx.writeEventToStream()`.

```javascript
await ctx.writeEventToStream({ type: "Progress", step: i });
const result = await wf.runStreaming({}, (event) => console.log(event));
```

```sh
npx tsx examples/streaming_workflow.ts
```

---

## Branching Workflow

Conditional fan-out by returning an array of events.

```javascript
return [
  { type: "PositiveEvent", text },
  { type: "NegativeEvent", text },
];
```

```sh
npx tsx examples/branching_workflow.ts
```

---

## LLM RAG Workflow

Multi-step RAG pipeline using context for shared state between steps. Uses typed `ChatMessage` and `CompletionResponse`:

```javascript
import { CompletionModel, ChatMessage } from "blazen";

// Reads OPENAI_API_KEY from the environment by default.
const model = CompletionModel.openai();
const response = await model.complete([
  ChatMessage.system("Answer based on the provided documents."),
  ChatMessage.user(query),
]);
console.log(response.content);     // typed string
console.log(response.usage);       // { promptTokens, completionTokens, totalTokens }
await ctx.set("documents", docs);
```

```sh
npx tsx examples/llm_rag_workflow.ts
```

---

## Human-in-the-Loop

Side-effect steps that pause for external input via `ctx.sendEvent()`.

```javascript
await ctx.sendEvent({ type: "ReviewComplete" });
return null;
```

```sh
npx tsx examples/human_in_the_loop.ts
```

---

## Stateful Workflow

Demonstrates the two explicit context namespaces on the Node bindings:

- `ctx.state` -- persistable values (survives `pause()`/`resume()`).
- `ctx.session` -- in-process-only values (excluded from snapshots).

**Important:** On the Node binding, `ctx.session` values are routed through `serde_json::Value` -- JS object identity is **NOT** preserved across `session.get(key)`. This is a napi-rs threading limitation (`Reference<T>` is `!Send` because its `Drop` must run on the v8 main thread). For true identity preservation of live JS objects, use the Python or WASM bindings.

```javascript
const wf = new Workflow("stateful-example");
wf.addStep("setup", ["blazen::StartEvent"], async (event, ctx) => {
  await ctx.state.set("counter", 5);              // persistable
  await ctx.session.set("reqId", "abc123");       // in-process only
  return { type: "blazen::StopEvent", result: {} };
});
```

```sh
npx tsx examples/stateful_workflow.ts
```

---

## Example 7: Subclassing CompletionModel (Custom Provider)

Extend `CompletionModel` in TypeScript/JavaScript to build a custom provider. Override `complete()` and/or `stream()` with any backend -- local inference, a proxy, a mock, etc. Subclass instances work with `runAgent`, `withRetry()`, `withCache()`, and every other helper.

```typescript
import {
  ChatMessage,
  CompletionModel,
  runAgent,
  type CompletionResponse,
} from "blazen";

class EchoLLM extends CompletionModel {
  constructor() {
    super({ modelId: "echo-llm", contextLength: 4096 });
  }

  async complete(messages: ChatMessage[]): Promise<CompletionResponse> {
    const last = [...messages].reverse().find((m) => m.role === "user");
    return {
      content: `echo: ${last?.content ?? ""}`,
      model: this.modelId,
      toolCalls: [],
    } as CompletionResponse;
  }

  // Override stream() when you want incremental output.
  async *stream(messages: ChatMessage[]) {
    const last = [...messages].reverse().find((m) => m.role === "user");
    for (const word of `echo: ${last?.content ?? ""}`.split(" ")) {
      yield { delta: word + " " };
    }
  }
}

const model = new EchoLLM();
const result = await runAgent(
  model,
  [ChatMessage.user("hello world")],
  [], // no tools
  async () => { throw new Error("no tools"); }, // toolHandler -- not called without tools
);
console.log(result.response.content); // -> "echo: hello world"
```

```sh
npx tsx examples/subclass_completion_model.ts
```

---

## Example 8: Custom MemoryBackend (DictBackend)

Extend `MemoryBackend` to plug in any storage layer (Postgres, DynamoDB, SQLite, a plain Map). Each async method is dispatched from Rust back into JS, so you get full control while reusing Blazen's embedding and SimHash search pipeline.

```typescript
import { EmbeddingModel, Memory, MemoryBackend } from "blazen";

class DictBackend extends MemoryBackend {
  private store = new Map<string, any>();

  async put(entry: any): Promise<void> {
    this.store.set(entry.id, entry);
  }

  async get(id: string): Promise<any | null> {
    return this.store.get(id) ?? null;
  }

  async delete(id: string): Promise<boolean> {
    return this.store.delete(id);
  }

  async list(): Promise<any[]> {
    return Array.from(this.store.values());
  }

  async len(): Promise<number> {
    return this.store.size;
  }

  async searchByBands(bands: string[], limit: number): Promise<any[]> {
    const set = new Set(bands);
    const hits: any[] = [];
    for (const entry of this.store.values()) {
      const entryBands: string[] = entry.bands ?? [];
      if (entryBands.some((b) => set.has(b))) {
        hits.push(entry);
        if (hits.length >= limit) break;
      }
    }
    return hits;
  }
}

const embedder = EmbeddingModel.embed();
const memory = new Memory(embedder, new DictBackend());

await memory.add("fact-1", "Rust has ownership and borrowing.");
await memory.add("fact-2", "Python uses reference counting.");

const results = await memory.search("memory management", 2);
for (const r of results) {
  console.log(r.score, r.text);
}
```

```sh
npx tsx examples/custom_memory_backend.ts
```

---

## Example 9: ModelManager with VRAM Budget

Track VRAM across multiple local models. Register each model with an estimated footprint; when loading a new one would exceed the budget, the least-recently-used model is automatically unloaded.

> **Note:** `budgetBytes`, `vramEstimateBytes`, `usedBytes`, `availableBytes`, and `vramEstimate` are JS `bigint`s (since the Rust side uses `u64` and `>4 GiB` budgets used to silently truncate). `budgetGb` is still a regular `number`.

```typescript
import { CompletionModel, ModelManager } from "blazen";

// 24 GB budget (GPU-typical for a consumer card).
const manager = new ModelManager({ budgetGb: 24 });

const llama8b = CompletionModel.mistralrs({
  modelId: "meta-llama/Llama-3.1-8B-Instruct",
});
const qwen14b = CompletionModel.mistralrs({
  modelId: "Qwen/Qwen2.5-14B-Instruct",
});
const mistral24b = CompletionModel.mistralrs({
  modelId: "mistralai/Mistral-Small-24B",
});

// vramEstimateBytes is a bigint -- use bigint literal arithmetic.
await manager.register("llama-8b", llama8b, 8n * 1024n ** 3n);
await manager.register("qwen-14b", qwen14b, 14n * 1024n ** 3n);
await manager.register("mistral-24b", mistral24b, 20n * 1024n ** 3n);

// Fits alongside qwen-14b (8 + 14 = 22 GB).
await manager.load("llama-8b");
await manager.load("qwen-14b");

// 20 GB does not fit next to 8 + 14 = 22 GB -- LRU (llama-8b) is evicted.
await manager.load("mistral-24b");

for (const s of await manager.status()) {
  // s.vramEstimate is a bigint; toLocaleString() works directly on bigints.
  console.log(`${s.id}: loaded=${s.loaded}, vram=${s.vramEstimate.toLocaleString()} bytes`);
}

// usedBytes() and availableBytes() resolve to bigint -- divide by a bigint to get GB.
const used = await manager.usedBytes();
const available = await manager.availableBytes();
console.log(`used=${used / (1024n ** 3n)} GB, available=${available / (1024n ** 3n)} GB`);
```

```sh
npx tsx examples/model_manager_budget.ts
```

---

## Example 10: Pricing Registration and Cost Tracking

Register pricing for any model ID (your own model, a local finetune, a custom deployment). Every `CompletionResponse` then carries a `cost` field computed from the registered rate.

```typescript
import {
  ChatMessage,
  CompletionModel,
  lookupPricing,
  registerPricing,
  type CompletionResponse,
} from "blazen";

class MyFinetune extends CompletionModel {
  constructor() {
    super({ modelId: "my-finetuned-model" });
  }

  async complete(messages: ChatMessage[]): Promise<CompletionResponse> {
    return {
      content: "Rust is a systems language with memory safety without GC.",
      model: this.modelId,
      toolCalls: [],
      usage: { promptTokens: 150, completionTokens: 80, totalTokens: 230 },
    } as CompletionResponse;
  }
}

// Register pricing once, globally, for any model ID.
registerPricing("my-finetuned-model", {
  inputPerMillion: 1.0,
  outputPerMillion: 2.0,
});

// Readback -- pricing is centrally stored.
const pricing = lookupPricing("my-finetuned-model");
if (pricing) {
  console.log(
    `input: $${pricing.inputPerMillion}/M, output: $${pricing.outputPerMillion}/M`,
  );
}

const model = new MyFinetune();
const response = await model.complete([
  ChatMessage.user("Summarize Rust in one line."),
]);
console.log(response.content);
console.log("usage:", response.usage);
// cost is computed from the registered pricing + usage.
console.log(`cost: $${response.cost?.toFixed(6)}`);
```

```sh
npx tsx examples/pricing_and_cost.ts
```

---

## Example 11: Per-Capability Provider (Custom TTS)

Extend `TTSProvider` to plug in any TTS backend (ElevenLabs, Coqui, a local model). The per-capability base classes (`TTSProvider`, `ImageProvider`, `VideoProvider`, `MusicProvider`, `ThreeDProvider`, `BackgroundRemovalProvider`, `VoiceProvider`) exist for users who only need to implement one capability.

```typescript
import { TTSProvider } from "blazen";

class MyElevenLabs extends TTSProvider {
  private apiKey: string;

  constructor(apiKey: string) {
    super({
      providerId: "elevenlabs",
      baseUrl: "https://api.elevenlabs.io/v1",
    });
    this.apiKey = apiKey;
  }

  async textToSpeech(request: any): Promise<any> {
    // In a real implementation, make an HTTP call with this.apiKey
    // and return audio bytes. Here we just echo the request.
    return {
      audioData: new Uint8Array([0, 1, 2]),
      format: "wav",
      voice: request.voice,
      text: request.text,
    };
  }
}

const tts = new MyElevenLabs("sk-...");
const result = await tts.textToSpeech({
  text: "Hello from Blazen!",
  voice: "alice",
});
console.log(result.format, result.audioData.length, "bytes");
```

```sh
npx tsx examples/custom_tts_provider.ts
```

---

## Example 12: Typed Error Handling

Blazen exports a typed error hierarchy so you can branch on failure modes with `instanceof` instead of string-matching messages. `ProviderError` carries structured fields (`provider`, `status`, `endpoint`, `requestId`, `detail`, `retryAfterMs`) populated from the underlying HTTP response. `RateLimitError`, `AuthError`, `TimeoutError`, `ContentPolicyError`, and the per-provider classes (`MistralRsError`, `WhisperError`, `PiperError`, ...) all extend `BlazenError`, so a single `instanceof BlazenError` check catches everything Blazen raises while letting unrelated runtime errors keep propagating.

```typescript
import {
  CompletionModel,
  ChatMessage,
  RateLimitError,
  ProviderError,
  BlazenError,
} from "blazen";

const model = CompletionModel.openai();
try {
  const response = await model.complete([ChatMessage.user("Hello")]);
  console.log(response.content);
} catch (e) {
  if (e instanceof RateLimitError) {
    // Backoff and retry -- retryAfterMs is populated from the Retry-After header
    // when the upstream provider supplies one.
    console.warn("Rate limited; retry later");
  } else if (e instanceof ProviderError) {
    console.error(
      `Provider ${e.provider} returned ${e.status}: ${e.detail}`,
    );
  } else if (e instanceof BlazenError) {
    console.error("Blazen error:", e.message);
  } else {
    throw e;
  }
}
```

```sh
npx tsx examples/typed_error_handling.ts
```

---

## Example 13: Custom Progress Reporting via ProgressCallback

Subclass `ProgressCallback` to plug structured download progress into any UI. The base class exists as a real Rust-backed type so subclass instances can be passed straight into `ModelCache.download` (and any other Blazen API that accepts a progress hook). `onProgress` receives byte counts as `bigint` to safely represent multi-gigabyte downloads; `total` is `null` when the server does not advertise `Content-Length`.

```typescript
import { ModelCache, ProgressCallback } from "blazen";

class TerminalProgress extends ProgressCallback {
  override onProgress(downloaded: bigint, total?: bigint | null): void {
    if (total) {
      const pct = ((Number(downloaded) / Number(total)) * 100).toFixed(1);
      process.stderr.write(`\r${pct}% (${downloaded}/${total})`);
    } else {
      process.stderr.write(`\r${downloaded} bytes`);
    }
  }
}

const cache = ModelCache.create();
const path = await cache.download(
  "mistralai/Mistral-7B-Instruct-v0.3",
  "config.json",
  new TerminalProgress(),
);
console.log(`\ncached at ${path}`);
```

```sh
npx tsx examples/progress_callback.ts
```

---

## Example 14: Pipeline State Persistence

Pipelines are multi-stage workflows with built-in checkpoint support. Register a JSON persistence callback with `onPersistJson` and Blazen will hand you a snapshot string after each stage completes -- ideal for shipping checkpoints to durable storage (S3, Postgres, an HTTP service) so a crashed run can be resumed later via `Pipeline.resume(snapshot)`. Use `onPersist` instead if you'd rather receive the typed `PipelineSnapshot` object directly.

```typescript
import { PipelineBuilder, Stage, Workflow } from "blazen";

const workflowA = new Workflow("step-1");
// ... addStep(...) calls populating workflowA ...
const workflowB = new Workflow("step-2");
// ... addStep(...) calls populating workflowB ...

const pipeline = new PipelineBuilder("ingest")
  .stage(new Stage("step-1", workflowA))
  .stage(new Stage("step-2", workflowB))
  .onPersistJson(async (snapshot: string) => {
    await fetch("/api/checkpoint", {
      method: "POST",
      body: snapshot,
      headers: { "Content-Type": "application/json" },
    });
  })
  .build();

const handler = await pipeline.start({ input: "..." });
const result = await handler.result();
console.log(result.finalOutput);
```

```sh
npx tsx examples/pipeline_persistence.ts
```

---

# WASM Examples

Source: https://blazen.dev/docs/examples/wasm
Language: wasm
Section: examples

# WASM Examples

Three complete examples that demonstrate real-world usage of the Blazen WASM SDK.

---

## Browser Chat App

A minimal chat interface that runs Blazen entirely in the browser. Tokens stream into the DOM as they arrive.

```html
<!DOCTYPE html>
<html>
<body>
  <div id="chat"></div>
  <input id="input" placeholder="Type a message..." />
  <button id="send">Send</button>

  <script type="module">
    import init, { CompletionModel, ChatMessage } from '@blazen/sdk';

    await init();

    const chat = document.getElementById('chat');
    const input = document.getElementById('input');
    const send = document.getElementById('send');
    const messages = [];

    // WASM reads OPENROUTER_API_KEY from the runtime env (or `process.env` in Node).
    // In production, proxy through your backend -- never expose keys client-side.
    const model = CompletionModel.openrouter();

    send.addEventListener('click', async () => {
      const text = input.value.trim();
      if (!text) return;
      input.value = '';

      messages.push(ChatMessage.user(text));
      const userDiv = document.createElement('div');
      userDiv.textContent = `You: ${text}`;
      chat.appendChild(userDiv);

      const assistantDiv = document.createElement('div');
      assistantDiv.textContent = 'Assistant: ';
      chat.appendChild(assistantDiv);

      await model.stream(messages, (chunk) => {
        if (chunk.delta) {
          assistantDiv.textContent += chunk.delta;
        }
      });

      messages.push(ChatMessage.assistant(assistantDiv.textContent.replace('Assistant: ', '')));
    });
  </script>
</body>
</html>
```

---

## Node.js Serverless Function

A serverless API endpoint that uses the WASM SDK with tool calling. Deploy to any platform that supports Node.js (Vercel, AWS Lambda, etc.).

```typescript
import init, { CompletionModel, ChatMessage, runAgent } from '@blazen/sdk';

let initialized = false;

const tools = [
  {
    name: 'lookupOrder',
    description: 'Look up an order by ID',
    parameters: {
      type: 'object',
      properties: { orderId: { type: 'string' } },
      required: ['orderId'],
    },
  },
  {
    name: 'cancelOrder',
    description: 'Cancel an order by ID',
    parameters: {
      type: 'object',
      properties: {
        orderId: { type: 'string' },
        reason: { type: 'string' },
      },
      required: ['orderId'],
    },
  },
];

async function toolHandler(toolName: string, args: Record<string, unknown>) {
  switch (toolName) {
    case 'lookupOrder':
      // Replace with your database call
      return { orderId: args.orderId, status: 'shipped', eta: '2026-03-21' };
    case 'cancelOrder':
      return { orderId: args.orderId, cancelled: true };
    default:
      throw new Error(`Unknown tool: ${toolName}`);
  }
}

export default async function handler(req: Request): Promise<Response> {
  if (!initialized) {
    await init();
    initialized = true;
  }

  const { message } = await req.json();
  // Reads OPENAI_API_KEY from process.env.
  const model = CompletionModel.openai();

  const result = await runAgent(
    model,
    [
      ChatMessage.system('You are a customer support agent. Use tools to look up and manage orders.'),
      ChatMessage.user(message),
    ],
    tools,
    toolHandler,
    { maxIterations: 5 }
  );

  return new Response(JSON.stringify({
    reply: result.response.content,
    iterations: result.iterations,
  }), {
    headers: { 'Content-Type': 'application/json' },
  });
}
```

---

## Tauri Desktop App

Use the WASM SDK inside a Tauri v2 app to run AI features locally without a server.

```typescript
// src/lib/ai.ts
import init, { CompletionModel, ChatMessage, Workflow } from '@blazen/sdk';

let ready = false;

export async function ensureInit() {
  if (!ready) {
    await init();
    ready = true;
  }
}

export async function summarize(text: string): Promise<string> {
  await ensureInit();

  const wf = new Workflow('summarizer');

  wf.addStep('summarize', ['blazen::StartEvent'], async (event, ctx) => {
    // The WASM SDK reads ANTHROPIC_API_KEY from the runtime environment.
    const model = CompletionModel.anthropic();
    const response = await model.complete([
      ChatMessage.system('Summarize the following text concisely.'),
      ChatMessage.user(event.text),
    ]);
    return {
      type: 'blazen::StopEvent',
      result: { summary: response.content },
    };
  });

  const result = await wf.run({ text });
  return result.data.summary;
}

export async function chat(
  messages: Array<{ role: string; content: string }>,
  onChunk: (text: string) => void
): Promise<void> {
  await ensureInit();

  // Reads OPENAI_API_KEY from the environment.
  const model = CompletionModel.openai();
  const chatMessages = messages.map((m) =>
    m.role === 'user' ? ChatMessage.user(m.content) : ChatMessage.assistant(m.content)
  );

  await model.stream(chatMessages, (chunk) => {
    if (chunk.delta) onChunk(chunk.delta);
  });
}
```

```typescript
// src/App.svelte (or your framework of choice)
import { chat } from './lib/ai';

let output = '';

async function handleSend() {
  output = '';
  await chat(
    [{ role: 'user', content: 'Explain Tauri in one paragraph.' }],
    (chunk) => { output += chunk; }
  );
}
```

The WASM binary runs inside the webview's JavaScript context. No Tauri command bridge is needed for AI calls -- only for filesystem or OS-level operations.

---

## Custom CompletionModel via `fromJsHandler`

WASM classes cannot be subclassed the way Python or Node classes can -- `wasm-bindgen` forbids it. Instead, the SDK exposes factory methods that accept JS handler functions. `CompletionModel.fromJsHandler` is the WASM equivalent of subclassing.

```typescript
import init, {
  ChatMessage,
  CompletionModel,
  runAgent,
} from "@blazen/sdk";

await init();

// Build a custom model by passing a complete handler (and optionally a stream handler).
const model = CompletionModel.fromJsHandler(
  "echo-llm",
  async (request) => {
    // request has the same shape as CompletionRequest.
    const last = [...request.messages].reverse().find((m: any) => m.role === "user");
    return {
      content: `echo: ${last?.content ?? ""}`,
      toolCalls: [],
      citations: [],
      artifacts: [],
      images: [],
      audio: [],
      videos: [],
      model: "echo-llm",
      metadata: {},
    };
  },
  // Optional stream handler -- fires the onChunk callback with StreamChunk-shaped objects.
  async (request, onChunk) => {
    const last = [...request.messages].reverse().find((m: any) => m.role === "user");
    for (const word of `echo: ${last?.content ?? ""}`.split(" ")) {
      onChunk({ delta: word + " " });
    }
  },
  // Config object -- everything optional. Pricing auto-registers into the global registry.
  {
    contextLength: 4096,
    maxOutputTokens: 2048,
    pricing: { inputPerMillion: 0.0, outputPerMillion: 0.0 },
  },
);

const result = await runAgent(
  model,
  [ChatMessage.user("hello world")],
  [],      // tools -- each item has { name, description, parameters, handler }
  {},      // options: { toolConcurrency?, maxIterations?, systemPrompt?, ... }
);
console.log(result.content); // -> "echo: hello world"
```

---

## Custom TTSProvider via Handler

Per-capability providers (`TTSProvider`, `ImageProvider`, `VideoProvider`, `MusicProvider`, `ThreeDProvider`, `BackgroundRemovalProvider`, `VoiceProvider`) follow the same handler pattern -- pass your async function to the constructor.

```typescript
import init, { TTSProvider } from "@blazen/sdk";

await init();

// TTSProvider takes a providerId and a single async handler.
const tts = new TTSProvider("elevenlabs", async (request) => {
  // request: { text, voice, voiceUrl?, language?, speed?, model?, parameters? }
  // Replace with a real HTTP call to your TTS backend.
  const audio = new Uint8Array([0, 1, 2]);
  return {
    audioData: audio,
    format: "wav",
    voice: request.voice,
    text: request.text,
  };
});

const result = await tts.textToSpeech({
  text: "Hello from Blazen!",
  voice: "alice",
});
console.log(result.format, result.audioData.length, "bytes");
```

For multi-method providers (e.g. `MusicProvider`), the constructor accepts an object of named handlers:

```typescript
import init, { MusicProvider } from "@blazen/sdk";

await init();

const music = new MusicProvider("local-musicgen", {
  generateMusic: async (request) => {
    return { audioData: new Uint8Array(), format: "wav" };
  },
  generateSfx: async (request) => {
    return { audioData: new Uint8Array(), format: "wav" };
  },
});
```

---

## ModelManager (WASM)

The WASM `ModelManager` tracks VRAM across registered models and evicts the least-recently-used one when the budget would be exceeded. Because WASM classes cannot be subclassed, the manager takes an explicit `lifecycle` object with `load()` and `unload()` async methods.

```typescript
import init, { CompletionModel, ModelManager } from "@blazen/sdk";

await init();

// 8 GB VRAM budget (conservative for a laptop GPU).
const manager = new ModelManager(8);

// Construct models backed by @mlc-ai/web-llm (lazy-loaded at complete-time).
const llama = CompletionModel.webLlm("Llama-3.1-8B-Instruct-q4f32_1-MLC");
const qwen = CompletionModel.webLlm("Qwen2.5-7B-Instruct-q4f32_1-MLC");

// Each model registers with a lifecycle object. In a real app, this
// calls into the WebLLM engine to load/unload GPU resources.
manager.register("llama-8b", llama, 4_500_000_000, {
  load: async () => { console.log("loading llama..."); },
  unload: async () => { console.log("unloading llama..."); },
});
manager.register("qwen-7b", qwen, 4_200_000_000, {
  load: async () => { console.log("loading qwen..."); },
  unload: async () => { console.log("unloading qwen..."); },
});

await manager.load("llama-8b");
await manager.load("qwen-7b"); // evicts llama-8b (4.5 + 4.2 > 8 GB budget)

for (const s of manager.status()) {
  console.log(`${s.id}: loaded=${s.loaded}, vram=${s.vramEstimate}`);
}
console.log(`used=${manager.usedBytes}, available=${manager.availableBytes}`);
```

---

## ModelRegistry (WASM)

Wraps a JS object exposing `listModels()` and `getModel(modelId)` so browser code can plug a custom model catalog (a fetched manifest, an in-browser registry, a control-plane endpoint) into Blazen's model-info lookup surface. Same shape as the Python `ModelRegistry` ABC and the Node `ModelRegistry` class, so workflow code reads identically across runtimes.

```typescript
import init, { ModelRegistry } from "@blazen/sdk";
import type { ModelInfo } from "@blazen/sdk";

await init();

// Back the registry with whatever source you like -- a fetched manifest,
// an offline IndexedDB cache, or a control-plane endpoint.
const registry = new ModelRegistry({
  async listModels(): Promise<ModelInfo[]> {
    const res = await fetch("/api/models");
    if (!res.ok) throw new Error(`registry fetch failed: ${res.status}`);
    return res.json();
  },
  async getModel(modelId: string): Promise<ModelInfo | null> {
    const res = await fetch(`/api/models/${modelId}`);
    if (res.status === 404) return null;
    if (!res.ok) throw new Error(`registry fetch failed: ${res.status}`);
    return res.json();
  },
});

const all = await registry.listModels();
console.log(`available models: ${all.length}`);

const gpt = await registry.getModel("gpt-4o");
if (gpt) {
  console.log(gpt.id, gpt.provider, gpt.capabilities);
}
```

Both methods may return synchronous values too -- the SDK awaits whatever the callback returned, so a purely in-memory registry can skip the `async` keyword entirely.

---

## Pricing Registration and Cost Tracking (WASM)

`registerPricing` attaches USD-per-million-token rates to any model ID. Completions produced with that model ID then carry a computed `cost` field.

```typescript
import init, {
  ChatMessage,
  CompletionModel,
  computeCost,
  lookupPricing,
  registerPricing,
} from "@blazen/sdk";

await init();

// Register pricing once, globally.
registerPricing("my-finetuned-model", 1.0, 2.0); // $1/M input, $2/M output

// Lookup
const p = lookupPricing("my-finetuned-model");
if (p) {
  console.log(`input: $${p.inputPerMillion}/M, output: $${p.outputPerMillion}/M`);
}

// Compute a cost directly from token counts.
const cost = computeCost("my-finetuned-model", 1500, 800);
console.log(`estimated cost: $${cost?.toFixed(6)}`);

// Or route through any model that emits the same modelId -- for example, a
// custom handler that tags its responses with "my-finetuned-model".
const model = CompletionModel.fromJsHandler(
  "my-finetuned-model",
  async (_request) => ({
    content: "…",
    toolCalls: [],
    citations: [],
    artifacts: [],
    images: [],
    audio: [],
    videos: [],
    model: "my-finetuned-model",
    usage: { promptTokens: 1500, completionTokens: 800, totalTokens: 2300 },
    metadata: {},
  }),
  undefined,
  {},
);

const response = await model.complete([ChatMessage.user("hi")]);
console.log(`cost: $${response.cost?.toFixed(6)}`); // populated from the global registry
```

---

## In-Browser RAG with `TractEmbedModel` + `InMemoryBackend`

`TractEmbedModel` runs ONNX-format sentence-transformers entirely in the browser via [tract](https://github.com/sonos/tract) -- no remote embedding API required. Pair it with a typed `InMemoryBackend` and the high-level `Memory` facade for a fully client-side semantic search store.

```typescript
import init, { TractEmbedModel, Memory, InMemoryBackend } from "@blazen/sdk";

await init();

// Both URLs must be CORS-enabled. HuggingFace's `resolve/main/...` paths
// serve the right headers out of the box.
const embedder = await TractEmbedModel.create(
  "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx",
  "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer.json",
);

// `fromBackend` keeps reads/writes inside WASM linear memory --
// no JS round-trips per call (unlike `fromJsBackend`).
const memory = Memory.fromBackend(embedder, new InMemoryBackend());

await memory.addMany([
  { id: "doc1", text: "Blazen is a Rust workflow engine." },
  { id: "doc2", text: "WebAssembly runs in browsers, Node.js, and edge runtimes." },
  { id: "doc3", text: "Tract is a tiny ONNX inference engine written in Rust." },
]);

const results = await memory.search("What is Blazen?", 3, null);
results.forEach((r) => console.log(r.id, r.score, r.text));
```

For cross-tab durability, swap `InMemoryBackend` for a JS-side IndexedDB backend via `Memory.fromJsBackend(embedder, backend)` -- the backend object just needs to implement `put`, `get`, `delete`, `list`, `len`, and `searchByBands`.

---

## Pipeline Snapshot Persistence to IndexedDB

`PipelineBuilder.onPersistJson` hands you a JSON-shaped snapshot every time the pipeline reaches a checkpoint. Persist it to IndexedDB (or any other store) so a refresh or tab close does not lose progress.

```typescript
import init, { PipelineBuilder, Stage, Context } from "@blazen/sdk";

await init();

// `idb` here is an `IDBPDatabase` from the `idb` npm package; substitute your
// favourite IndexedDB wrapper. The callback fires after each successful stage
// commit, so it doubles as a place to update progress UI.
const pipeline = new PipelineBuilder("ingest-pipeline")
  .addStage(
    new Stage("normalize", async (input: any, _ctx: Context) => ({
      text: String(input.text ?? "").trim().toLowerCase(),
    })),
  )
  .addStage(
    new Stage("tokenize", async (input: any, _ctx: Context) => ({
      tokens: input.text.split(/\s+/).filter(Boolean),
    })),
  )
  .onPersistJson(async (snapshot: unknown) => {
    const tx = idb.transaction("checkpoints", "readwrite");
    await tx.objectStore("checkpoints").put(snapshot, "current");
    await tx.done;
  })
  .build();

const result = await pipeline.run({ text: "  Hello WASM World  " });
console.log("tokens:", result.tokens);
```

On the next page load, read `checkpoints/current` back out and feed it to `PipelineBuilder.fromSnapshot(...)` to resume mid-flight instead of restarting from stage zero.

---

## Human-in-the-Loop with `runWithHandler` + `streamEvents`

`Workflow.runWithHandler` returns a live `WorkflowHandler` instead of awaiting the terminal event. Pair it with `streamEvents` to react to every event the engine publishes -- including `InputRequestEvent`, which is the WASM SDK's hook for human-in-the-loop prompts. Send the answer back through `respondToInput(requestId, response)` and the parked event loop unparks immediately.

```typescript
import init, { Workflow } from "@blazen/sdk";

await init();

const workflow = new Workflow("topic-researcher");

workflow.addStep("clarify", ["blazen::StartEvent"], async (event, _ctx) => {
  // Ask the human to confirm or refine the research topic before we burn
  // tokens. The engine auto-parks on this event until `respondToInput` lands.
  return {
    type: "blazen::InputRequestEvent",
    request_id: crypto.randomUUID(),
    prompt: `Confirm the topic to research: "${event.topic}"`,
    metadata: null,
  };
});

workflow.addStep("research", ["blazen::InputResponseEvent"], async (event, _ctx) => {
  // `event.response` is whatever the JS side passed to `respondToInput`.
  return {
    type: "blazen::StopEvent",
    result: { confirmed_topic: event.response },
  };
});

const handler = await workflow.runWithHandler({ topic: "tract embeddings" });

// `streamEvents` resolves when the workflow ends; events emitted before this
// call are NOT replayed, so subscribe before you await the terminal result.
const streaming = handler.streamEvents((event: { event_type: string; data: any }) => {
  if (event.event_type === "blazen::InputRequestEvent") {
    const answer = window.prompt(event.data.prompt) ?? "";
    handler.respondToInput(event.data.request_id, answer);
  } else {
    console.log("event:", event.event_type, event.data);
  }
});

const [, finalResult] = await Promise.all([streaming, handler.awaitResult()]);
console.log("done:", finalResult);
```

Outside the browser (Node, Deno, edge runtimes) just swap `window.prompt` for whatever input source you have -- a WebSocket message, a CLI readline, an HTTP form post -- and call `respondToInput` from there. The handler is thread-safe in the sense that it's a single-owner JS object, so as long as the `respondToInput` call happens on the same JS event loop that owns the handler, the workflow unparks correctly.