Batch Completions
Run many completion requests concurrently with bounded parallelism
complete_batch (Python) / completeBatch (Node) / blazen_llm::batch::complete_batch (Rust) drives a CompletionModel with a list of independent conversations in parallel, capped by a configurable concurrency limit. It preserves input order, reports per-request success/failure, and aggregates token usage and cost across the batch.
Overview
Use batch completion when you have many short, independent prompts — classification, labelling, scoring, RAG retrievers that fan out across chunks. It is not a replacement for the OpenAI “Batch API” (half-price, 24-hour latency); Blazen’s batch runs every request live and returns as fast as the slowest request in the flight completes.
Key properties:
- Bounded concurrency — a semaphore caps in-flight requests.
0means unlimited. - Partial failures — each request is awaited independently. One failure does not cancel the rest.
- Order-preserving — the output list lines up 1:1 with the input list.
- Aggregated usage —
total_usageandtotal_costsum across successful responses.
Basic usage
from blazen import CompletionModel, ChatMessage, complete_batch
model = CompletionModel.openai()
conversations = [
[ChatMessage.user("What is 2 + 2?")],
[ChatMessage.user("What is the capital of France?")],
[ChatMessage.user("Who wrote Hamlet?")],
]
result = await complete_batch(model, conversations, concurrency=4)
for i, resp in enumerate(result.responses):
if resp is not None:
print(f"[{i}] {resp.content}")
else:
print(f"[{i}] ERROR: {result.errors[i]}")
print("Total tokens:", result.total_usage)
print("Total cost: $", result.total_cost)
import { CompletionModel, ChatMessage, completeBatch } from "blazen";
const model = CompletionModel.openai({ apiKey: process.env.OPENAI_API_KEY });
const result = await completeBatch(
model,
[
[ChatMessage.user("What is 2 + 2?")],
[ChatMessage.user("What is the capital of France?")],
[ChatMessage.user("Who wrote Hamlet?")],
],
{ concurrency: 4 },
);
for (let i = 0; i < result.responses.length; i++) {
const resp = result.responses[i];
if (resp) {
console.log(`[${i}]`, resp.content);
} else {
console.error(`[${i}] ERROR:`, result.errors[i]);
}
}
use blazen_llm::batch::{complete_batch, BatchConfig};
use blazen_llm::{ChatMessage, CompletionRequest};
use blazen_llm::providers::openai::OpenAiProvider;
use blazen_llm::traits::CompletionModel;
let model = OpenAiProvider::from_env()?;
let requests = vec![
CompletionRequest::new(vec![ChatMessage::user("What is 2 + 2?")]),
CompletionRequest::new(vec![ChatMessage::user("What is the capital of France?")]),
CompletionRequest::new(vec![ChatMessage::user("Who wrote Hamlet?")]),
];
let result = complete_batch(&model, requests, BatchConfig::new(4)).await;
for (i, response) in result.responses.iter().enumerate() {
match response {
Ok(r) => println!("[{i}] {}", r.content.as_deref().unwrap_or("")),
Err(e) => eprintln!("[{i}] ERROR: {e}"),
}
}
Applying options to every request
Pass a shared CompletionOptions / JsCompletionOptions to apply temperature, max tokens, or a tool set to every request in the flight:
from blazen import CompletionOptions
result = await complete_batch(
model,
conversations,
concurrency=8,
options=CompletionOptions(temperature=0.2, max_tokens=200),
)
const result = await completeBatch(model, conversations, {
concurrency: 8,
temperature: 0.2,
maxTokens: 200,
});
Handling partial failures
Each element of result.responses is either a completion or None / null. The matching index in result.errors holds the error message when a request failed. This lets you retry only the failing subset or surface a structured error to the caller without losing the successful answers.
failed_indices = [i for i, r in enumerate(result.responses) if r is None]
print(f"{len(failed_indices)} of {len(conversations)} requests failed")
BatchResult
| Field | Type | Description |
|---|---|---|
responses | list[CompletionResponse | None] | Per-request results in input order. |
errors | list[str | None] | Per-request error messages. None when the request succeeded. |
total_usage | dict | None | Summed prompt_tokens, completion_tokens, and total_tokens across successes. |
total_cost | float | None | Summed USD cost across successes (only set when the provider reports pricing). |
The Rust version returns BatchResult with responses: Vec<Result<CompletionResponse, BlazenError>> instead — it does not split success and error into separate vectors.
Node BatchResult class
The Node binding returns a typed BatchResult class. Field access works through getters, so the snippet above (result.responses[i], result.errors[i]) reads exactly like a plain object, but you also get richer summary accessors and a printable form:
| Accessor | Type | Description |
|---|---|---|
.responses | (CompletionResponse | null)[] | Per-request results in input order. |
.errors | (BlazenError | null)[] | Per-request errors. null when the request succeeded. |
.totalUsage | TokenUsage | Summed promptTokens, completionTokens, and totalTokens across successes. |
.totalCost | number | Summed USD cost across successes (zero when no provider in the flight reports pricing). |
.successCount | number | Number of requests that produced a CompletionResponse. |
.failureCount | number | Number of requests that produced a BlazenError. |
.length | number | Total request count — always successCount + failureCount. |
.toString() | string | Human-readable summary, useful for logs. |
import { BatchResult, completeBatch } from "blazen";
const result = await completeBatch(model, conversations, { concurrency: 8 });
if (result instanceof BatchResult) {
console.log(`${result.successCount}/${result.length} succeeded`);
console.log("usage:", result.totalUsage);
console.log("cost: $", result.totalCost);
console.log(result.toString());
}
instanceof BatchResult narrows the value for TypeScript and is the canonical way to discriminate the result from any wrapping union you build around it.
The Python BatchResult mirrors the same shape: .responses, .errors, .total_usage, .total_cost, .success_count, .failure_count, plus __len__ so len(result) returns the total request count.
Choosing a concurrency level
0(unlimited) is fine for fast providers with generous rate limits and small batches (under 100 requests).- For rate-limited providers, set
concurrencyto your per-second budget divided by expected per-request latency. - When combining with
with_retry, remember the semaphore slot is held for the full retry chain of a single request — budget accordingly.
See also
- Custom Providers — batch also works with subclassed
CompletionModels - Prompt Templates — render a templated system prompt once and fan it across many user messages
- Chat Window — build each conversation within a token budget before batching