Local Inference

Run embeddings and LLMs entirely in the browser with the Blazen WASM SDK

The Blazen WASM SDK can run AI models entirely in the browser — no API key required after the initial model download. This guide covers three options:

  1. Built-in TractEmbedModel — ONNX embeddings shipped directly inside the SDK, no extra JS dependency.
  2. transformers.js — bring-your-own embedding pipeline via EmbeddingModel.fromJsHandler().
  3. WebLLM — WebGPU-accelerated chat models via CompletionModel.fromJsHandler().

What is possible

TractEmbedModel.create() runs ONNX models inside the WASM module via the tract inference engine. The EmbeddingModel.fromJsHandler() and CompletionModel.fromJsHandler() factories let you plug any JavaScript inference library into the Blazen pipeline. The model runs on the user’s device while Blazen’s Memory, CompletionModel.withFallback(), withRetry(), and withCache() work exactly the same as they do with hosted APIs.

Use cases:

  • Offline-first apps — search and chat without a network connection
  • Privacy-sensitive data — embeddings and queries never leave the device
  • Zero marginal cost — no per-token charges after the model is cached
  • Hybrid patterns — fast local embeddings for search, hosted API for generation

Browser compatibility (April 2026)

Local inference relies on WebGPU (for LLMs) and WebAssembly (for embeddings). Current support:

BrowserWebGPU sinceNotes
Chrome / Edge113 (May 2023)Full support
Safari26 (Sep 2025)Including iOS and iPadOS
Firefox141 Windows / 145 macOS ARMLinux support in progress

Approximately 65% of global users have WebGPU support. Embeddings via WASM (CPU) work in all modern browsers regardless of WebGPU.

Built-in embeddings with TractEmbedModel

TractEmbedModel runs ONNX embedding models directly inside the WASM module — no extra npm dependency, no pipeline() call. The model and tokenizer are fetched from any URL (typically Hugging Face) using web_sys::fetch, so the URLs must be reachable with browser-compatible CORS.

Installation

npm install @blazen/sdk

Usage

import { TractEmbedModel, init } from "@blazen/sdk";

await init();

const model = await TractEmbedModel.create(
  "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx",
  "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer.json",
);

const result = await model.embed(["hello", "world"]);
console.log(result.embeddings); // (number[])[]
console.log(model.dimensions);  // e.g. 384

The returned embeddings is a number[][] (one vector per input). model.dimensions reports the model’s output size after the first inference.

When to pick TractEmbedModel vs transformers.js

ConcernTractEmbedModeltransformers.js
DependenciesJust @blazen/sdk+ @huggingface/transformers
Model fetchDirect web_sys::fetch of ONNX + tokenizer URLsHugging Face Hub via JS loader
GPU accelerationCPU only (tract WASM)WebGPU when available, WASM fallback
Bundle impactNone beyond the SDK~1.2 MB JS + ~3.5 MB ORT WASM lazy-loaded

Use TractEmbedModel when you want the smallest bundle and full control of the model URL. Use transformers.js when you want WebGPU acceleration or richer pre/post-processing.

CORS requirement

Because the SDK fetches the ONNX file from the browser, the host must serve Access-Control-Allow-Origin headers that permit your origin. Hugging Face’s resolve/main URLs already do this. If you self-host weights, configure CORS on your CDN.

Local embeddings with transformers.js

transformers.js v4 runs Hugging Face models in the browser via ONNX Runtime. For embeddings, it uses WebAssembly SIMD on the CPU — no GPU required.

Installation

npm install @blazen/sdk @huggingface/transformers

Usage

import init, { EmbeddingModel, Memory } from '@blazen/sdk';
import { pipeline } from '@huggingface/transformers';

await init();

// Load the transformers.js feature-extraction pipeline
const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Wrap it as a Blazen EmbeddingModel
const embedder = EmbeddingModel.fromJsHandler(
  'all-MiniLM-L6-v2',
  384,    // dimensions for this model
  async (texts) => {
    const output = await pipe(texts, { pooling: 'mean', normalize: true });
    return Array.from({ length: texts.length }, (_, i) => {
      const row = output[i];
      return row.data instanceof Float32Array
        ? row.data
        : new Float32Array(row.data);
    });
  },
);

// Use it with Memory for semantic search
const memory = new Memory(embedder);
await memory.add('doc1', 'Paris is the capital of France');
await memory.add('doc2', 'Rust is a systems programming language');

const results = await memory.search("What is France's capital?", 5);
console.log(results[0].text); // "Paris is the capital of France"

The fromJsHandler API

EmbeddingModel.fromJsHandler(
  modelId: string,      // identifier for logging / display
  dimensions: number,   // vector dimensionality (e.g. 384 for MiniLM)
  handler: (texts: string[]) => Promise<Float32Array[]> | Float32Array[]
): EmbeddingModel

The handler receives an array of strings and must return one Float32Array per input text, each with exactly dimensions elements.

Performance

BackendLatency (single query)Notes
WASM (CPU)~170 msWorks everywhere, no WebGPU needed
WebGPU~35 msChrome 113+, Safari 26+, Firefox 141+

transformers.js automatically uses WebGPU when available and falls back to WASM.

Bundle size

ComponentSizeWhen loaded
@huggingface/transformers~1.2 MBImport time
ONNX Runtime WASM~3.5 MBLazy, on first inference
Model weights~23 MBCached in browser after first download

The model weights are cached in the browser’s Cache API. Subsequent page loads skip the download.

Local LLM with WebLLM

WebLLM runs large language models on WebGPU. It compiles models to the user’s specific GPU at first load, then caches the compiled shaders for near-instant subsequent starts.

Installation

npm install @blazen/sdk @mlc-ai/web-llm

Usage

import init, { CompletionModel, ChatMessage } from '@blazen/sdk';
import * as webllm from '@mlc-ai/web-llm';

await init();

// Create the WebLLM engine (downloads + compiles on first visit)
const engine = await webllm.CreateMLCEngine(
  'Llama-3.2-1B-Instruct-q4f16_1-MLC',
  {
    initProgressCallback: (p) => console.log(p.text),
  },
);

// Wrap it as a Blazen CompletionModel
const model = CompletionModel.fromJsHandler(
  'Llama-3.2-1B-Instruct',
  async (request) => {
    const messages = (request.messages || []).map((m) => ({
      role: m.role,
      content: typeof m.content === 'string'
        ? m.content
        : m.content?.text || '',
    }));

    const reply = await engine.chat.completions.create({
      messages,
      temperature: request.temperature ?? 0.7,
      max_tokens: request.max_tokens ?? 512,
    });

    return {
      content: reply.choices?.[0]?.message?.content || '',
      toolCalls: [],
      citations: [],
      artifacts: [],
      images: [],
      audio: [],
      videos: [],
      model: 'Llama-3.2-1B-Instruct',
      metadata: {},
    };
  },
);

// Use it like any other Blazen model
const response = await model.complete([
  ChatMessage.user('Explain WebAssembly in one sentence.'),
]);
console.log(response.content);

The fromJsHandler API

CompletionModel.fromJsHandler(
  modelId: string,
  completeHandler: (request: CompletionRequest) => Promise<CompletionResponse>,
  streamHandler?: (request: CompletionRequest, onChunk: (chunk: StreamChunk) => void) => Promise<void>
): CompletionModel

The completeHandler receives a CompletionRequest-shaped object and must return a CompletionResponse-shaped object. The optional streamHandler enables token streaming; if omitted, model.stream() falls back to calling completeHandler and yielding the result as a single chunk.

Practical model sizes

Stick to 1B-3B parameter models for a usable browser experience:

ModelDownloadCold startTokens/sec
Llama-3.2-1B-Instruct (q4f16)~600 MB~30 s40-60
Llama-3.2-3B-Instruct (q4f16)~1.8 GB~60 s15-30
Llama-3.1-8B-Instruct (q4f16)~4.5 GB2-3 min5-10

Models at 7B+ parameters require 4+ GB of GPU memory and have cold starts measured in minutes. They are not recommended unless you know your users have high-end hardware.

Always ship a hosted API fallback

Not every user has WebGPU. Even those who do may be on low-end hardware or a phone with insufficient memory. The recommended production pattern is to try local inference first and fall back to a cloud API:

import init, { CompletionModel, ChatMessage } from '@blazen/sdk';

await init();

let localModel;
try {
  // Try to create local model (WebLLM, etc.)
  localModel = CompletionModel.fromJsHandler('local', localHandler);
} catch {
  // WebGPU not available or model too large
}

const apiModel = CompletionModel.openrouter(); // reads OPENROUTER_API_KEY

// If local model loaded, try it first; if it fails, use the API.
// If local model did not load, use the API directly.
const model = localModel
  ? CompletionModel.withFallback([localModel, apiModel])
  : apiModel;

const response = await model.complete([ChatMessage.user('Hello!')]);

CompletionModel.withFallback() is a static method that takes an array of models and tries them in order. If the first model throws a retryable error, the next is tried. Non-retryable errors (auth failures, invalid input) short-circuit.

In-browser RAG

Combine local embeddings with Memory for full retrieval-augmented generation that runs entirely on the device. The simplest in-browser store is InMemoryBackend, which keeps vectors in a Map for the lifetime of the page:

import { Memory, InMemoryBackend, TractEmbedModel } from "@blazen/sdk";

const embedder = await TractEmbedModel.create(modelUrl, tokenizerUrl);
const memory = Memory.fromBackend(embedder, new InMemoryBackend());

await memory.upsert([{ id: "doc1", content: "..." }]);
const results = await memory.query("question", 5);

Memory.fromBackend(embedder, backend) is the wasm equivalent of the native MemoryBuilder — it pairs any embedding model with any backend that implements the memory backend trait. InMemoryBackend is ideal for ephemeral session state and demos; swap in a persistent backend (IndexedDB-backed, server-side, etc.) when the session needs to survive a reload.

Full RAG pipeline with the built-in embedder

import init, { Memory, InMemoryBackend, TractEmbedModel, CompletionModel, ChatMessage } from '@blazen/sdk';

await init();

const embedder = await TractEmbedModel.create(
  'https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx',
  'https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer.json',
);
const memory = Memory.fromBackend(embedder, new InMemoryBackend());

await memory.upsert([
  { id: 'faq-1', content: 'Refunds are processed within 5-7 business days.' },
  { id: 'faq-2', content: 'Free shipping on orders over $50.' },
  { id: 'faq-3', content: 'Contact support at help@example.com.' },
]);

async function answerQuestion(question: string) {
  const context = await memory.query(question, 3);
  const contextText = context.map((r) => r.content).join('\n');
  const llm = CompletionModel.openrouter();
  const response = await llm.complete([
    ChatMessage.system(`Answer using only this context:\n${contextText}`),
    ChatMessage.user(question),
  ]);
  return response.content;
}

The embedding + similarity search runs locally. Only the final generation call hits the API, and even that can be replaced with a local WebLLM model.

Same pipeline with a transformers.js embedder

import init, { EmbeddingModel, Memory, InMemoryBackend, CompletionModel, ChatMessage } from '@blazen/sdk';
import { pipeline } from '@huggingface/transformers';

await init();

const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embedder = EmbeddingModel.fromJsHandler('MiniLM', 384, async (texts) => {
  const output = await pipe(texts, { pooling: 'mean', normalize: true });
  return Array.from({ length: texts.length }, (_, i) => {
    const row = output[i];
    return row.data instanceof Float32Array ? row.data : new Float32Array(row.data);
  });
});

const memory = Memory.fromBackend(embedder, new InMemoryBackend());
await memory.upsert([{ id: 'doc1', content: 'Paris is the capital of France' }]);
const results = await memory.query("What is France's capital?", 5);

Either embedder slots into the same Memory.fromBackend(...) call — pick whichever fits your bundle and acceleration constraints.

Managing memory pressure with ModelManager

ModelManager wraps the same blazen_manager::ModelManager used by the native runtime — an LRU that evicts the least-recently-used model when the byte budget is exceeded. In a long-running browser session that loads several Tract models (e.g. one embedder, one reranker, one classifier), ModelManager prevents the page from accumulating unbounded GPU/heap memory.

The JS-facing API matches the native one:

MethodPurpose
register(id, model, memoryEstimateBytes, lifecycle)Declare a model with { load, unload } lifecycle hooks (plus optional memoryBytes() and device() for pool routing)
unregister(id)Drop a registration (and any loaded weights)
load(id)Load (or reuse) and mark as MRU within its pool
unload(id)Free without unregistering
isLoaded(id)Check whether weights are currently resident
usedBytes(pool?) / availableBytes(pool?)Per-pool budget telemetry. pool defaults to "cpu"; pass "gpu:0" etc. for GPU pools
pools()List every configured pool and its byte budget
budgetBytesRead-only getter for the CPU pool budget (use pools() for other pools)
status()Snapshot of every registration (each entry carries memoryEstimateBytes and pool)
import init, { ModelManager, TractEmbedModel } from '@blazen/sdk';

await init();

const manager = new ModelManager(0.5); // 0.5 GB (512 MB) CPU pool budget -- constructor takes gigabytes

manager.register('mini-lm', null, 90 * 1024 * 1024, {
  load: async () => TractEmbedModel.create(miniLmOnnx, miniLmTokenizer),
  unload: async () => {},
  isLoaded: () => false,
  memoryBytes: async () => 90 * 1024 * 1024,
  device: () => 'cpu',
});
manager.register('bge-small', null, 130 * 1024 * 1024, {
  load: async () => TractEmbedModel.create(bgeOnnx, bgeTokenizer),
  unload: async () => {},
  isLoaded: () => false,
  memoryBytes: async () => 130 * 1024 * 1024,
  device: () => 'cpu',
});

const embedder = await manager.load('mini-lm');  // loads + marks MRU
console.log(await manager.usedBytes(), '/', manager.budgetBytes); // usedBytes is a method; budgetBytes is the CPU pool getter

for (const s of await manager.status()) {
  console.log(`${s.id}: loaded=${s.loaded}, pool=${s.pool}, memory=${s.memoryEstimateBytes}`);
}

When a load() call would push usedBytes(pool) past that pool’s budget, the manager evicts the LRU entry from the same pool first. A GPU model never evicts a CPU model and vice versa. Use it whenever your app might keep more than one Tract model alive at a time.

COOP/COEP trap

Some guides for WASM threading recommend adding these HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Do not add these headers unless you specifically need SharedArrayBuffer for WASM SIMD multi-threading. They break:

  • Stripe, PayPal, and other payment iframes
  • YouTube / Vimeo embeds
  • Google OAuth popups
  • Most third-party ad scripts

transformers.js and WebLLM work without these headers. They use single-threaded WASM or WebGPU, neither of which requires SharedArrayBuffer.

If you do need SIMD threads (rare), scope the headers to a dedicated /inference path with a service worker rather than applying them site-wide.

iOS Safari warning

WebGPU in Safari 26+ works on iOS and iPadOS, but iPhones have limited unified memory:

DeviceRAMCan run 1B LLM?Can run 3B LLM?
iPhone 15 Pro / 16 Pro8 GBYesMarginal
iPhone 15 / 166 GBMarginalNo
iPhone 14 and earlier6 GB or lessNoNo

Embedding models (~23 MB) work fine on all iPhones. For LLMs, detect available memory or WebGPU adapter limits before attempting to load a model, and fall back to a hosted API.

Troubleshooting

@huggingface/transformers not found

The import must resolve at runtime. If you are not using a bundler, add an import map:

<script type="importmap">
{
  "imports": {
    "@huggingface/transformers": "https://cdn.jsdelivr.net/npm/@huggingface/transformers"
  }
}
</script>

WebGPU initialization failure

Error: WebGPU is not supported in this browser

The user’s browser does not support WebGPU. This affects LLM inference via WebLLM but not embeddings via transformers.js (which uses WASM). Use the fallback pattern described above.

Model download stalls

Large models (3B+) require a stable connection for the initial download. If the download stalls, the browser’s Cache API may have a corrupt entry. Clear site data and retry, or switch to a smaller model.

SharedArrayBuffer is not defined

This error usually means COOP/COEP headers are misconfigured. See the COOP/COEP section above. If you are using a library that requires SharedArrayBuffer, verify that both headers are set correctly:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

But prefer libraries that do not require them.

Slow first inference

The first inference after page load is always slower because:

  1. transformers.js: The ONNX Runtime WASM module (~3.5 MB) loads lazily on first call.
  2. WebLLM: The model is compiled to GPU shaders on first use (~30-60 seconds).

Subsequent inferences are fast. For transformers.js, you can warm up the pipeline at load time:

// Warm up: embed an empty string to trigger WASM loading
await embedder.embed(['']);

Next steps