Local Inference

Run LLMs, embeddings, TTS, and transcription on-device

Blazen can run every major model class entirely on your own hardware — no API key, no network call, no data leaving the machine. Local backends are opt-in via feature flags on the Rust crates, and the Python/Node packages ship prebuilt wheels that include the most common ones.

Overview

CapabilityBackendRust crateFeature flag
LLM chatmistral.rsblazen-llm-mistralrsmistralrs
LLM chatllama.cppblazen-llm-llamacppllamacpp
LLM chatCandleblazen-llm-candlecandle
EmbeddingsBlazen embed (fastembed on glibc/mac/windows, tract on musl/wasm)blazen-embedembed
EmbeddingsCandleblazen-embed-candlecandle-embed
Transcriptionwhisper.cppblazen-audio-whispercppwhispercpp
TTSPiperblazen-audio-piperpiper
Image generationStable Diffusionblazen-image-diffusiondiffusion

Every local provider implements the same trait as its remote counterpart — CompletionModel, EmbeddingModel, Transcription, etc. — so they slot into the exact same workflows and can be swapped with a one-line change.

Local LLM (mistral.rs)

from blazen import CompletionModel, MistralRsOptions, Quantization, Device, ChatMessage

model = CompletionModel.mistralrs(options=MistralRsOptions(
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    quantization=Quantization.Q4KM,
    device=Device.Cuda,
    context_length=8192,
))

# First call downloads the GGUF weights (~4 GB for Q4KM Mistral-7B).
# Subsequent calls reuse the cached weights.
await model.load()

response = await model.complete([ChatMessage.user("Hello!")])
print(response.content)

# Free VRAM when you need the GPU for something else.
await model.unload()
use blazen_llm_mistralrs::{MistralRsOptions, MistralRsProvider};
use blazen_llm::{ChatMessage, CompletionRequest, LocalModel, CompletionModel};

let mut provider = MistralRsProvider::new(
    MistralRsOptions::new("mistralai/Mistral-7B-Instruct-v0.3")
        .with_quantization("Q4KM")
        .with_device("cuda"),
)?;

provider.load().await?;
let response = provider
    .complete(CompletionRequest::new(vec![ChatMessage::user("Hello!")]))
    .await?;
println!("{}", response.content.unwrap_or_default());

CompletionModel.mistralrs, CompletionModel.llamacpp, and CompletionModel.candle all follow the same shape: required model_id, optional quantization and device hints, optional context length and cache directory.

Typed streaming chunks

Streaming local completions yield named classes rather than anonymous objects, so editor autocomplete and type-checkers know exactly what fields are available.

  • mistral.rs (un-prefixed): ChatMessageInput, InferenceChunk, InferenceChunkStream, InferenceResult, InferenceUsage.
  • llama.cpp (LlamaCpp prefix): parallel surface — LlamaCppChatMessageInput, LlamaCppInferenceChunk, LlamaCppInferenceChunkStream, LlamaCppInferenceResult, LlamaCppInferenceUsage.
  • Candle: CandleInferenceResult only — single-shot, no streaming.
const stream: InferenceChunkStream = await model.completeStream([
  { role: "user", content: "Hello!" } satisfies ChatMessageInput,
]);
for await (const chunk of stream) {
  process.stdout.write(chunk.content ?? "");
}

Swap InferenceChunkStream for LlamaCppInferenceChunkStream when you build the model via CompletionModel.llamacpp — the surface is identical, only the prefix changes.

stream = await model.complete_stream([ChatMessage.user("Hello!")])
async for chunk in stream:
    print(chunk.content or "", end="")

Local embeddings

from blazen import EmbeddingModel, EmbedOptions

model = EmbeddingModel.local(options=EmbedOptions(
    model_name="BGESmallENV15",     # 384 dims, ~33 MB download
    cache_dir="/tmp/blazen-embed",
    max_batch_size=256,
))

resp = await model.embed(["hello", "world"])
print(len(resp.embeddings[0]))  # 384
import { EmbeddingModel } from "blazen";

const model = EmbeddingModel.embed({
  modelName: "BGESmallENV15",
  cacheDir: "/tmp/blazen-embed",
});

const resp = await model.embed(["hello", "world"]);

Blazen’s embed backend runs through ONNX Runtime on glibc/mac/windows and pure-Rust tract on musl — CPU-only, no GPU required, no Python ML runtime needed. Models cache locally after the first download.

Browser embeddings (WASM tract)

In the browser there is no hf-hub and no filesystem cache, so TractEmbedModel.create(modelUrl, tokenizerUrl, options) fetches the ONNX weights and the tokenizer.json directly over HTTP via web_sys::fetch. Host the two files on any CDN (or your own origin with permissive CORS) and pass the URLs in:

import { TractEmbedModel } from "@blazen/sdk";

const model = await TractEmbedModel.create(
  "https://example.com/all-MiniLM-L6-v2.onnx",
  "https://example.com/tokenizer.json",
);
const result = await model.embed(["hello", "world"]);

Inference runs entirely on the main thread (or a Web Worker if you spawn one) using pure-Rust tract — no WebGPU, no ONNX Runtime Web, no server round-trip.

Local transcription (whisper.cpp)

from blazen import Transcription, TranscriptionRequest, WhisperOptions, WhisperModel

transcriber = Transcription.whispercpp(options=WhisperOptions(
    model=WhisperModel.Base,
    language="en",
))

result = await transcriber.transcribe(
    TranscriptionRequest.from_file("/path/to/audio.wav")
)
print(result.text)

See the dedicated Transcription guide for audio format requirements and model-size tradeoffs.

Memory budgeting with ModelManager (CPU RAM + GPU VRAM)

When running several local models side by side, the ModelManager tracks per-pool memory budgets (host RAM and GPU VRAM as separate buckets) and evicts the least-recently-used model in the same pool when a new load would exceed that pool’s budget. Models in different pools never evict each other.

from blazen import ModelManager, CompletionModel, EmbeddingModel, MistralRsOptions

# 64 GB of CPU RAM, 24 GB of GPU VRAM (RTX 4090).
manager = ModelManager(cpu_ram_gb=64, gpu_vram_gb=24)

llm = CompletionModel.mistralrs(options=MistralRsOptions(
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    device="cuda:0",
))
embedder = EmbeddingModel.local()  # runs on CPU

await manager.register("llm", llm, memory_estimate_bytes=6 * 1024**3)
await manager.register("embed", embedder, memory_estimate_bytes=100 * 1024**2)

await manager.load("llm")
await manager.ensure_loaded("embed")   # different pool, no eviction

print(await manager.used_bytes(pool="gpu:0"))       # bytes on the GPU pool
print(await manager.available_bytes(pool="gpu:0"))  # room left on the GPU pool
print(await manager.used_bytes(pool="cpu"))         # bytes on the CPU pool
use blazen_manager::ModelManager;
use blazen_llm::Pool;

// 64 GB of CPU RAM, 24 GB of GPU VRAM -- pools are tracked independently.
let manager = ModelManager::with_budgets_gb(64.0, 24.0);
manager.register("llm", llm, 6 * 1024 * 1024 * 1024).await;
manager.load("llm").await?;

If a register + load call would blow the budget, the manager first unloads the least-recently-used model in the same pool whose removal creates enough headroom. This lets you register far more models than fit in any one pool and rely on the LRU policy to keep the hot set resident.

Model cache and downloads

All local backends download weights lazily on the first call to load() (or the first inference if you skip load). Weights are cached under the OS default model cache directory unless you override cache_dir on the options struct. Typical sizes:

  • Tiny Whisper: ~32 MB
  • BGE-Small-en-v1.5: ~33 MB
  • Mistral-7B Q4KM: ~4.1 GB
  • Stable Diffusion XL base: ~6 GB
  • Whisper Large-V3: ~3.1 GB

Set a shared cache_dir across providers to keep everything in one place and make disk usage auditable.

Choosing CPU vs GPU

  • CPU-only workloads — Blazen embed, whisper.cpp (without cuda/metal/coreml), llama.cpp CPU builds, Piper. Use when the deployment target has no GPU or when latency is not critical.
  • GPU-accelerated — mistral.rs (device=Cuda/Metal), llama.cpp (cuda/metal), Candle, whisper.cpp with the right feature flag, Stable Diffusion. Use when per-token latency matters or the model is too large for CPU throughput.

Feature flags are selected at build time. The default Python wheels ship with embed, whispercpp, and mistralrs enabled; enable more by building from source with extra flags.

See also

  • Embeddings — deeper dive on the local embed model variants
  • Transcription — whisper.cpp audio format requirements and model sizes
  • Media Generation — for cloud-hosted counterparts to local generation
  • Custom Providers — wrap a local binary or gRPC service that Blazen does not ship