Batch Completions

Run many completion requests concurrently with bounded parallelism

complete_batch (Python) / completeBatch (Node) / blazen_llm::batch::complete_batch (Rust) drives a CompletionModel with a list of independent conversations in parallel, capped by a configurable concurrency limit. It preserves input order, reports per-request success/failure, and aggregates token usage and cost across the batch.

Overview

Use batch completion when you have many short, independent prompts — classification, labelling, scoring, RAG retrievers that fan out across chunks. It is not a replacement for the OpenAI “Batch API” (half-price, 24-hour latency); Blazen’s batch runs every request live and returns as fast as the slowest request in the flight completes.

Key properties:

Bounded concurrency — a semaphore caps in-flight requests. 0 means unlimited.
Partial failures — each request is awaited independently. One failure does not cancel the rest.
Order-preserving — the output list lines up 1:1 with the input list.
Aggregated usage — total_usage and total_cost sum across successful responses.

Basic usage

from blazen import CompletionModel, ChatMessage, complete_batch

model = CompletionModel.openai()

conversations = [
    [ChatMessage.user("What is 2 + 2?")],
    [ChatMessage.user("What is the capital of France?")],
    [ChatMessage.user("Who wrote Hamlet?")],
]

result = await complete_batch(model, conversations, concurrency=4)

for i, resp in enumerate(result.responses):
    if resp is not None:
        print(f"[{i}] {resp.content}")
    else:
        print(f"[{i}] ERROR: {result.errors[i]}")

print("Total tokens:", result.total_usage)
print("Total cost:  $", result.total_cost)

import { CompletionModel, ChatMessage, completeBatch } from "blazen";

const model = CompletionModel.openai({ apiKey: process.env.OPENAI_API_KEY });

const result = await completeBatch(
  model,
  [
    [ChatMessage.user("What is 2 + 2?")],
    [ChatMessage.user("What is the capital of France?")],
    [ChatMessage.user("Who wrote Hamlet?")],
  ],
  { concurrency: 4 },
);

for (let i = 0; i < result.responses.length; i++) {
  const resp = result.responses[i];
  if (resp) {
    console.log(`[${i}]`, resp.content);
  } else {
    console.error(`[${i}] ERROR:`, result.errors[i]);
  }
}

use blazen_llm::batch::{complete_batch, BatchConfig};
use blazen_llm::{ChatMessage, CompletionRequest};
use blazen_llm::providers::openai::OpenAiProvider;
use blazen_llm::traits::CompletionModel;

let model = OpenAiProvider::from_env()?;

let requests = vec![
    CompletionRequest::new(vec![ChatMessage::user("What is 2 + 2?")]),
    CompletionRequest::new(vec![ChatMessage::user("What is the capital of France?")]),
    CompletionRequest::new(vec![ChatMessage::user("Who wrote Hamlet?")]),
];

let result = complete_batch(&model, requests, BatchConfig::new(4)).await;

for (i, response) in result.responses.iter().enumerate() {
    match response {
        Ok(r) => println!("[{i}] {}", r.content.as_deref().unwrap_or("")),
        Err(e) => eprintln!("[{i}] ERROR: {e}"),
    }
}

Applying options to every request

Pass a shared CompletionOptions / JsCompletionOptions to apply temperature, max tokens, or a tool set to every request in the flight:

from blazen import CompletionOptions

result = await complete_batch(
    model,
    conversations,
    concurrency=8,
    options=CompletionOptions(temperature=0.2, max_tokens=200),
)

const result = await completeBatch(model, conversations, {
  concurrency: 8,
  temperature: 0.2,
  maxTokens: 200,
});

Handling partial failures

Each element of result.responses is either a completion or None / null. The matching index in result.errors holds the error message when a request failed. This lets you retry only the failing subset or surface a structured error to the caller without losing the successful answers.

failed_indices = [i for i, r in enumerate(result.responses) if r is None]
print(f"{len(failed_indices)} of {len(conversations)} requests failed")

BatchResult

Field	Type	Description
`responses`	`list[CompletionResponse \| None]`	Per-request results in input order.
`errors`	`list[str \| None]`	Per-request error messages. `None` when the request succeeded.
`total_usage`	`dict \| None`	Summed `prompt_tokens`, `completion_tokens`, and `total_tokens` across successes.
`total_cost`	`float \| None`	Summed USD cost across successes (only set when the provider reports pricing).

The Rust version returns BatchResult with responses: Vec<Result<CompletionResponse, BlazenError>> instead — it does not split success and error into separate vectors.

Node `BatchResult` class

The Node binding returns a typed BatchResult class. Field access works through getters, so the snippet above (result.responses[i], result.errors[i]) reads exactly like a plain object, but you also get richer summary accessors and a printable form:

Accessor	Type	Description
`.responses`	`(CompletionResponse \| null)[]`	Per-request results in input order.
`.errors`	`(BlazenError \| null)[]`	Per-request errors. `null` when the request succeeded.
`.totalUsage`	`TokenUsage`	Summed `promptTokens`, `completionTokens`, and `totalTokens` across successes.
`.totalCost`	`number`	Summed USD cost across successes (zero when no provider in the flight reports pricing).
`.successCount`	`number`	Number of requests that produced a `CompletionResponse`.
`.failureCount`	`number`	Number of requests that produced a `BlazenError`.
`.length`	`number`	Total request count — always `successCount + failureCount`.
`.toString()`	`string`	Human-readable summary, useful for logs.

import { BatchResult, completeBatch } from "blazen";

const result = await completeBatch(model, conversations, { concurrency: 8 });

if (result instanceof BatchResult) {
  console.log(`${result.successCount}/${result.length} succeeded`);
  console.log("usage:", result.totalUsage);
  console.log("cost:  $", result.totalCost);
  console.log(result.toString());
}

instanceof BatchResult narrows the value for TypeScript and is the canonical way to discriminate the result from any wrapping union you build around it.

The Python BatchResult mirrors the same shape: .responses, .errors, .total_usage, .total_cost, .success_count, .failure_count, plus __len__ so len(result) returns the total request count.

Choosing a concurrency level

0 (unlimited) is fine for fast providers with generous rate limits and small batches (under 100 requests).
For rate-limited providers, set concurrency to your per-second budget divided by expected per-request latency.
When combining with with_retry, remember the semaphore slot is held for the full retry chain of a single request — budget accordingly.