Local Inference

Run LLMs, embeddings, TTS, and transcription on-device

Blazen can run every major model class entirely on your own hardware — no API key, no network call, no data leaving the machine. Local backends are opt-in via feature flags on the Rust crates, and the Python/Node packages ship prebuilt wheels that include the most common ones.

Overview

CapabilityBackendRust crateFeature flag
LLM chatmistral.rsblazen-llm-mistralrsmistralrs
LLM chatllama.cppblazen-llm-llamacppllamacpp
LLM chatCandleblazen-llm-candlecandle
EmbeddingsBlazen embed (fastembed on glibc/mac/windows, tract on musl/wasm)blazen-embedembed
EmbeddingsCandleblazen-embed-candlecandle-embed
Transcriptionwhisper.cppblazen-audio-whispercppwhispercpp
TranscriptionCandle Whisperblazen-audio-sttcandle
TTS (native engines)AnyTTS (Kokoro / VibeVoice / Qwen3-TTS)blazen-audio-ttsanytts
TTS (Bark)Suno Barkblazen-audio-ttsbark
TTS (F5-TTS)SWivid F5-TTSblazen-audio-ttsf5-tts
TTS (Piper)Piper (vendored)blazen-audio-ttspiper
Voice conversionRVC (HuBERT-base v2 ContentVec)blazen-audio-vcrvc
Music generationMusicGen / AudioGenblazen-audio-musicmusicgen / audiogen
Music generationStable Audioblazen-audio-musicstable-audio
Audio codecEnCodec / DAC / SNACblazen-audio-codecencodec / dac / snac
3D generationTripoSR (image-to-mesh)blazen-3dtriposr
3D generationCompat3D HTTP proxyblazen-3dthreed-compat-proxy
Image generationStable Diffusionblazen-image-diffusiondiffusion

Every local provider implements the same trait as its remote counterpart — Model, EmbeddingModel, Transcription, etc. — so they slot into the exact same workflows and can be swapped with a one-line change.

Local LLM (mistral.rs)

from blazen import Model, MistralRsOptions, Quantization, Device, ChatMessage

model = Model.mistralrs(options=MistralRsOptions(
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    quantization=Quantization.Q4KM,
    device=Device.Cuda,
    context_length=8192,
))

# First call downloads the GGUF weights (~4 GB for Q4KM Mistral-7B).
# Subsequent calls reuse the cached weights.
await model.load()

response = await model.complete([ChatMessage.user("Hello!")])
print(response.content)

# Free VRAM when you need the GPU for something else.
await model.unload()
use blazen_llm_mistralrs::{MistralRsOptions, MistralRsProvider};
use blazen_llm::{ChatMessage, ModelRequest, LocalModel, Model};

let mut provider = MistralRsProvider::new(
    MistralRsOptions::new("mistralai/Mistral-7B-Instruct-v0.3")
        .with_quantization("Q4KM")
        .with_device("cuda"),
)?;

provider.load().await?;
let response = provider
    .complete(ModelRequest::new(vec![ChatMessage::user("Hello!")]))
    .await?;
println!("{}", response.content.unwrap_or_default());

Model.mistralrs, Model.llamacpp, and Model.candle all follow the same shape: required model_id, optional quantization and device hints, optional context length and cache directory.

Typed streaming chunks

Streaming local completions yield named classes rather than anonymous objects, so editor autocomplete and type-checkers know exactly what fields are available.

  • mistral.rs (un-prefixed): ChatMessageInput, InferenceChunk, InferenceChunkStream, InferenceResult, InferenceUsage.
  • llama.cpp (LlamaCpp prefix): parallel surface — LlamaCppChatMessageInput, LlamaCppInferenceChunk, LlamaCppInferenceChunkStream, LlamaCppInferenceResult, LlamaCppInferenceUsage.
  • Candle: CandleInferenceResult only — single-shot, no streaming.
const stream: InferenceChunkStream = await model.completeStream([
  { role: "user", content: "Hello!" } satisfies ChatMessageInput,
]);
for await (const chunk of stream) {
  process.stdout.write(chunk.content ?? "");
}

Swap InferenceChunkStream for LlamaCppInferenceChunkStream when you build the model via Model.llamacpp — the surface is identical, only the prefix changes.

stream = await model.complete_stream([ChatMessage.user("Hello!")])
async for chunk in stream:
    print(chunk.content or "", end="")

Local embeddings

from blazen import EmbeddingModel, EmbedOptions

model = EmbeddingModel.local(options=EmbedOptions(
    model_name="BGESmallENV15",     # 384 dims, ~33 MB download
    cache_dir="/tmp/blazen-embed",
    max_batch_size=256,
))

resp = await model.embed(["hello", "world"])
print(len(resp.embeddings[0]))  # 384
import { EmbeddingModel } from "blazen";

const model = EmbeddingModel.embed({
  modelName: "BGESmallENV15",
  cacheDir: "/tmp/blazen-embed",
});

const resp = await model.embed(["hello", "world"]);

Blazen’s embed backend runs through ONNX Runtime on glibc/mac/windows and pure-Rust tract on musl — CPU-only, no GPU required, no Python ML runtime needed. Models cache locally after the first download.

Browser embeddings (WASM tract)

In the browser there is no hf-hub and no filesystem cache, so TractEmbedModel.create(modelUrl, tokenizerUrl, options) fetches the ONNX weights and the tokenizer.json directly over HTTP via web_sys::fetch. Host the two files on any CDN (or your own origin with permissive CORS) and pass the URLs in:

import { TractEmbedModel } from "@blazen/sdk";

const model = await TractEmbedModel.create(
  "https://example.com/all-MiniLM-L6-v2.onnx",
  "https://example.com/tokenizer.json",
);
const result = await model.embed(["hello", "world"]);

Inference runs entirely on the main thread (or a Web Worker if you spawn one) using pure-Rust tract — no WebGPU, no ONNX Runtime Web, no server round-trip.

Local transcription (whisper.cpp)

from blazen import Transcription, TranscriptionRequest, WhisperOptions, WhisperModel

transcriber = Transcription.whispercpp(options=WhisperOptions(
    model=WhisperModel.Base,
    language="en",
))

result = await transcriber.transcribe(
    TranscriptionRequest.from_file("/path/to/audio.wav")
)
print(result.text)

See the dedicated Transcription guide for audio format requirements and model-size tradeoffs.

Native 3D (TripoSR)

blazen-3d ships a native TripoSR backend that turns a single image into a textured 3D mesh end-to-end on-device: DINOv2 image encoder, triplane transformer, NeRF/SDF field, and marching-cubes mesh extraction. No remote calls, no Python runtime — weights download from Hugging Face on first use and cache locally. Enable with the triposr feature on blazen-3d (or the same feature on blazen-py / blazen-node). For sites that already have a hosted 3D service, the threed-compat-proxy feature exposes Compat3dProvider, which POSTs to any OpenAI-style upstream over multipart HTTP.

from blazen import ThreeDGeneration, TripoSrOptions

mesher = ThreeDGeneration.triposr(options=TripoSrOptions())
mesh = await mesher.image_to_mesh("/path/to/photo.png")

Native TTS / voice conversion

blazen-audio-tts multiplexes several native text-to-speech engines behind a single backend trait. AnyTTS (feature anytts) covers Kokoro, VibeVoice, and Qwen3-TTS via the upstream any-tts crate; Bark (feature bark) runs Suno’s 3-stage AR transformer with an EnCodec vocoder; F5-TTS (feature f5-tts) runs SWivid’s flow-matching DiT with a Vocos vocoder for zero-shot voice cloning; Piper (feature piper) ships the vendored ONNX + espeak-ng pipeline. For speaker transformation, blazen-audio-vc (feature rvc) provides Retrieval-based Voice Conversion using a HuBERT-base v2 ContentVec encoder. Every backend implements the same TtsBackend / VcBackend trait so they swap with a one-line change.

Memory budgeting with ModelManager (CPU RAM + GPU VRAM)

ModelManager is the unified registry for Blazen models: register local models and remote providers by name, then dispatch with complete(id, messages) (also stream(id, messages, on_chunk) and get(id)). For local models it additionally tracks per-pool memory budgets (host RAM and GPU VRAM as separate buckets) and evicts the least-recently-used model in the same pool when a new load would exceed that pool’s budget. Models in different pools never evict each other. Remote providers own no local weights, so they register with a 0 estimate, dispatch straight through, and never count against a budget — which lets you mix a cloud fallback alongside your local hot set behind one dispatch surface.

from blazen import (
    ChatMessage,
    ModelManager,
    Model,
    EmbeddingModel,
    MistralRsOptions,
    OpenAiProvider,
    ProviderOptions,
)

# 64 GB of CPU RAM, 24 GB of GPU VRAM (RTX 4090).
manager = ModelManager(cpu_ram_gb=64, gpu_vram_gb=24)

llm = Model.mistralrs(options=MistralRsOptions(
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    device="cuda:0",
))
embedder = EmbeddingModel.local()  # runs on CPU

await manager.register("llm", llm, memory_estimate_bytes=6 * 1024**3)
await manager.register("embed", embedder, memory_estimate_bytes=100 * 1024**2)

# Remote provider in the SAME registry: dispatch-only, no footprint.
await manager.register(
    "cloud", OpenAiProvider(options=ProviderOptions(api_key="sk-...")),
)

await manager.load("llm")
await manager.ensure_loaded("embed")   # different pool, no eviction

print(await manager.used_bytes(pool="gpu:0"))       # bytes on the GPU pool
print(await manager.available_bytes(pool="gpu:0"))  # room left on the GPU pool
print(await manager.used_bytes(pool="cpu"))         # bytes on the CPU pool

# Dispatch any registered entry by name — local or remote, same call shape.
local = await manager.complete("llm", [ChatMessage.user("Hello!")])
cloud = await manager.complete("cloud", [ChatMessage.user("Hello!")])
use blazen_manager::ModelManager;
use blazen_llm::Pool;

// 64 GB of CPU RAM, 24 GB of GPU VRAM -- pools are tracked independently.
let manager = ModelManager::with_budgets_gb(64.0, 24.0);
manager.register("llm", llm, 6 * 1024 * 1024 * 1024).await;
manager.load("llm").await?;

If a register + load call would blow the budget, the manager first unloads the least-recently-used model in the same pool whose removal creates enough headroom. This lets you register far more models than fit in any one pool and rely on the LRU policy to keep the hot set resident.

Model cache and downloads

All local backends download weights lazily on the first call to load() (or the first inference if you skip load). Weights are cached under the OS default model cache directory unless you override cache_dir on the options struct. Typical sizes:

  • Tiny Whisper: ~32 MB
  • BGE-Small-en-v1.5: ~33 MB
  • Mistral-7B Q4KM: ~4.1 GB
  • Stable Diffusion XL base: ~6 GB
  • Whisper Large-V3: ~3.1 GB

Set a shared cache_dir across providers to keep everything in one place and make disk usage auditable.

Choosing CPU vs GPU

  • CPU-only workloads — Blazen embed, whisper.cpp (without cuda/metal/coreml), llama.cpp CPU builds, Piper. Use when the deployment target has no GPU or when latency is not critical.
  • GPU-accelerated — mistral.rs (device=Cuda/Metal), llama.cpp (cuda/metal), Candle, whisper.cpp with the right feature flag, Stable Diffusion. Use when per-token latency matters or the model is too large for CPU throughput.

Feature flags are selected at build time. The default Python wheels build the local-all umbrella feature, which enables embed, fastembed, tract, candle-embed, mistralrs, llamacpp, candle-llm, diffusion, whispercpp, tts, tiktoken, distributed, otlp, prometheus, langfuse, audio-music-stable-audio, and training. Build from source with maturin develop --features ... to add anything outside that set (for example triposr, bark, f5-tts, audio-vc-rvc, or threed-compat-proxy).

See also

  • Embeddings — deeper dive on the local embed model variants
  • Transcription — whisper.cpp audio format requirements and model sizes
  • Media Generation — for cloud-hosted counterparts to local generation
  • Custom Providers — wrap a local binary or gRPC service that Blazen does not ship