Local Inference
Run LLMs, embeddings, TTS, and transcription on-device
Blazen can run every major model class entirely on your own hardware — no API key, no network call, no data leaving the machine. Local backends are opt-in via feature flags on the Rust crates, and the Python/Node packages ship prebuilt wheels that include the most common ones.
Overview
| Capability | Backend | Rust crate | Feature flag |
|---|---|---|---|
| LLM chat | mistral.rs | blazen-llm-mistralrs | mistralrs |
| LLM chat | llama.cpp | blazen-llm-llamacpp | llamacpp |
| LLM chat | Candle | blazen-llm-candle | candle |
| Embeddings | Blazen embed (fastembed on glibc/mac/windows, tract on musl/wasm) | blazen-embed | embed |
| Embeddings | Candle | blazen-embed-candle | candle-embed |
| Transcription | whisper.cpp | blazen-audio-whispercpp | whispercpp |
| Transcription | Candle Whisper | blazen-audio-stt | candle |
| TTS (native engines) | AnyTTS (Kokoro / VibeVoice / Qwen3-TTS) | blazen-audio-tts | anytts |
| TTS (Bark) | Suno Bark | blazen-audio-tts | bark |
| TTS (F5-TTS) | SWivid F5-TTS | blazen-audio-tts | f5-tts |
| TTS (Piper) | Piper (vendored) | blazen-audio-tts | piper |
| Voice conversion | RVC (HuBERT-base v2 ContentVec) | blazen-audio-vc | rvc |
| Music generation | MusicGen / AudioGen | blazen-audio-music | musicgen / audiogen |
| Music generation | Stable Audio | blazen-audio-music | stable-audio |
| Audio codec | EnCodec / DAC / SNAC | blazen-audio-codec | encodec / dac / snac |
| 3D generation | TripoSR (image-to-mesh) | blazen-3d | triposr |
| 3D generation | Compat3D HTTP proxy | blazen-3d | threed-compat-proxy |
| Image generation | Stable Diffusion | blazen-image-diffusion | diffusion |
Every local provider implements the same trait as its remote counterpart — Model, EmbeddingModel, Transcription, etc. — so they slot into the exact same workflows and can be swapped with a one-line change.
Local LLM (mistral.rs)
from blazen import Model, MistralRsOptions, Quantization, Device, ChatMessage
model = Model.mistralrs(options=MistralRsOptions(
model_id="mistralai/Mistral-7B-Instruct-v0.3",
quantization=Quantization.Q4KM,
device=Device.Cuda,
context_length=8192,
))
# First call downloads the GGUF weights (~4 GB for Q4KM Mistral-7B).
# Subsequent calls reuse the cached weights.
await model.load()
response = await model.complete([ChatMessage.user("Hello!")])
print(response.content)
# Free VRAM when you need the GPU for something else.
await model.unload()
use blazen_llm_mistralrs::{MistralRsOptions, MistralRsProvider};
use blazen_llm::{ChatMessage, ModelRequest, LocalModel, Model};
let mut provider = MistralRsProvider::new(
MistralRsOptions::new("mistralai/Mistral-7B-Instruct-v0.3")
.with_quantization("Q4KM")
.with_device("cuda"),
)?;
provider.load().await?;
let response = provider
.complete(ModelRequest::new(vec![ChatMessage::user("Hello!")]))
.await?;
println!("{}", response.content.unwrap_or_default());
Model.mistralrs, Model.llamacpp, and Model.candle all follow the same shape: required model_id, optional quantization and device hints, optional context length and cache directory.
Typed streaming chunks
Streaming local completions yield named classes rather than anonymous objects, so editor autocomplete and type-checkers know exactly what fields are available.
- mistral.rs (un-prefixed):
ChatMessageInput,InferenceChunk,InferenceChunkStream,InferenceResult,InferenceUsage. - llama.cpp (
LlamaCppprefix): parallel surface —LlamaCppChatMessageInput,LlamaCppInferenceChunk,LlamaCppInferenceChunkStream,LlamaCppInferenceResult,LlamaCppInferenceUsage. - Candle:
CandleInferenceResultonly — single-shot, no streaming.
const stream: InferenceChunkStream = await model.completeStream([
{ role: "user", content: "Hello!" } satisfies ChatMessageInput,
]);
for await (const chunk of stream) {
process.stdout.write(chunk.content ?? "");
}
Swap InferenceChunkStream for LlamaCppInferenceChunkStream when you build the model via Model.llamacpp — the surface is identical, only the prefix changes.
stream = await model.complete_stream([ChatMessage.user("Hello!")])
async for chunk in stream:
print(chunk.content or "", end="")
Local embeddings
from blazen import EmbeddingModel, EmbedOptions
model = EmbeddingModel.local(options=EmbedOptions(
model_name="BGESmallENV15", # 384 dims, ~33 MB download
cache_dir="/tmp/blazen-embed",
max_batch_size=256,
))
resp = await model.embed(["hello", "world"])
print(len(resp.embeddings[0])) # 384
import { EmbeddingModel } from "blazen";
const model = EmbeddingModel.embed({
modelName: "BGESmallENV15",
cacheDir: "/tmp/blazen-embed",
});
const resp = await model.embed(["hello", "world"]);
Blazen’s embed backend runs through ONNX Runtime on glibc/mac/windows and pure-Rust tract on musl — CPU-only, no GPU required, no Python ML runtime needed. Models cache locally after the first download.
Browser embeddings (WASM tract)
In the browser there is no hf-hub and no filesystem cache, so TractEmbedModel.create(modelUrl, tokenizerUrl, options) fetches the ONNX weights and the tokenizer.json directly over HTTP via web_sys::fetch. Host the two files on any CDN (or your own origin with permissive CORS) and pass the URLs in:
import { TractEmbedModel } from "@blazen/sdk";
const model = await TractEmbedModel.create(
"https://example.com/all-MiniLM-L6-v2.onnx",
"https://example.com/tokenizer.json",
);
const result = await model.embed(["hello", "world"]);
Inference runs entirely on the main thread (or a Web Worker if you spawn one) using pure-Rust tract — no WebGPU, no ONNX Runtime Web, no server round-trip.
Local transcription (whisper.cpp)
from blazen import Transcription, TranscriptionRequest, WhisperOptions, WhisperModel
transcriber = Transcription.whispercpp(options=WhisperOptions(
model=WhisperModel.Base,
language="en",
))
result = await transcriber.transcribe(
TranscriptionRequest.from_file("/path/to/audio.wav")
)
print(result.text)
See the dedicated Transcription guide for audio format requirements and model-size tradeoffs.
Native 3D (TripoSR)
blazen-3d ships a native TripoSR backend that turns a single image into a textured 3D mesh end-to-end on-device: DINOv2 image encoder, triplane transformer, NeRF/SDF field, and marching-cubes mesh extraction. No remote calls, no Python runtime — weights download from Hugging Face on first use and cache locally. Enable with the triposr feature on blazen-3d (or the same feature on blazen-py / blazen-node). For sites that already have a hosted 3D service, the threed-compat-proxy feature exposes Compat3dProvider, which POSTs to any OpenAI-style upstream over multipart HTTP.
from blazen import ThreeDGeneration, TripoSrOptions
mesher = ThreeDGeneration.triposr(options=TripoSrOptions())
mesh = await mesher.image_to_mesh("/path/to/photo.png")
Native TTS / voice conversion
blazen-audio-tts multiplexes several native text-to-speech engines behind a single backend trait. AnyTTS (feature anytts) covers Kokoro, VibeVoice, and Qwen3-TTS via the upstream any-tts crate; Bark (feature bark) runs Suno’s 3-stage AR transformer with an EnCodec vocoder; F5-TTS (feature f5-tts) runs SWivid’s flow-matching DiT with a Vocos vocoder for zero-shot voice cloning; Piper (feature piper) ships the vendored ONNX + espeak-ng pipeline. For speaker transformation, blazen-audio-vc (feature rvc) provides Retrieval-based Voice Conversion using a HuBERT-base v2 ContentVec encoder. Every backend implements the same TtsBackend / VcBackend trait so they swap with a one-line change.
Memory budgeting with ModelManager (CPU RAM + GPU VRAM)
ModelManager is the unified registry for Blazen models: register local
models and remote providers by name, then dispatch with
complete(id, messages) (also stream(id, messages, on_chunk) and get(id)).
For local models it additionally tracks per-pool memory budgets (host RAM and
GPU VRAM as separate buckets) and evicts the least-recently-used model in the
same pool when a new load would exceed that pool’s budget. Models in different
pools never evict each other. Remote providers own no local weights, so they
register with a 0 estimate, dispatch straight through, and never count against
a budget — which lets you mix a cloud fallback alongside your local hot set
behind one dispatch surface.
from blazen import (
ChatMessage,
ModelManager,
Model,
EmbeddingModel,
MistralRsOptions,
OpenAiProvider,
ProviderOptions,
)
# 64 GB of CPU RAM, 24 GB of GPU VRAM (RTX 4090).
manager = ModelManager(cpu_ram_gb=64, gpu_vram_gb=24)
llm = Model.mistralrs(options=MistralRsOptions(
model_id="mistralai/Mistral-7B-Instruct-v0.3",
device="cuda:0",
))
embedder = EmbeddingModel.local() # runs on CPU
await manager.register("llm", llm, memory_estimate_bytes=6 * 1024**3)
await manager.register("embed", embedder, memory_estimate_bytes=100 * 1024**2)
# Remote provider in the SAME registry: dispatch-only, no footprint.
await manager.register(
"cloud", OpenAiProvider(options=ProviderOptions(api_key="sk-...")),
)
await manager.load("llm")
await manager.ensure_loaded("embed") # different pool, no eviction
print(await manager.used_bytes(pool="gpu:0")) # bytes on the GPU pool
print(await manager.available_bytes(pool="gpu:0")) # room left on the GPU pool
print(await manager.used_bytes(pool="cpu")) # bytes on the CPU pool
# Dispatch any registered entry by name — local or remote, same call shape.
local = await manager.complete("llm", [ChatMessage.user("Hello!")])
cloud = await manager.complete("cloud", [ChatMessage.user("Hello!")])
use blazen_manager::ModelManager;
use blazen_llm::Pool;
// 64 GB of CPU RAM, 24 GB of GPU VRAM -- pools are tracked independently.
let manager = ModelManager::with_budgets_gb(64.0, 24.0);
manager.register("llm", llm, 6 * 1024 * 1024 * 1024).await;
manager.load("llm").await?;
If a register + load call would blow the budget, the manager first unloads the least-recently-used model in the same pool whose removal creates enough headroom. This lets you register far more models than fit in any one pool and rely on the LRU policy to keep the hot set resident.
Model cache and downloads
All local backends download weights lazily on the first call to load() (or the first inference if you skip load). Weights are cached under the OS default model cache directory unless you override cache_dir on the options struct. Typical sizes:
- Tiny Whisper: ~32 MB
- BGE-Small-en-v1.5: ~33 MB
- Mistral-7B Q4KM: ~4.1 GB
- Stable Diffusion XL base: ~6 GB
- Whisper Large-V3: ~3.1 GB
Set a shared cache_dir across providers to keep everything in one place and make disk usage auditable.
Choosing CPU vs GPU
- CPU-only workloads — Blazen embed, whisper.cpp (without
cuda/metal/coreml), llama.cpp CPU builds, Piper. Use when the deployment target has no GPU or when latency is not critical. - GPU-accelerated — mistral.rs (
device=Cuda/Metal), llama.cpp (cuda/metal), Candle, whisper.cpp with the right feature flag, Stable Diffusion. Use when per-token latency matters or the model is too large for CPU throughput.
Feature flags are selected at build time. The default Python wheels build the local-all umbrella feature, which enables embed, fastembed, tract, candle-embed, mistralrs, llamacpp, candle-llm, diffusion, whispercpp, tts, tiktoken, distributed, otlp, prometheus, langfuse, audio-music-stable-audio, and training. Build from source with maturin develop --features ... to add anything outside that set (for example triposr, bark, f5-tts, audio-vc-rvc, or threed-compat-proxy).
See also
- Embeddings — deeper dive on the local embed model variants
- Transcription — whisper.cpp audio format requirements and model sizes
- Media Generation — for cloud-hosted counterparts to local generation
- Custom Providers — wrap a local binary or gRPC service that Blazen does not ship