Audio Transcription

Convert speech to text with fal.ai, local whisper.cpp, or pure-Rust Candle Whisper

Blazen’s Transcription provider converts audio into text with optional timestamped segments, language detection, and speaker diarization. Three backends ship out of the box:

fal.ai — remote Whisper hosted on fal’s compute platform. Accepts URL audio sources.
Candle Whisper — fully local, offline, pure-Rust transcription via candle-transformers. Lives in blazen-audio-stt.
whisper.cpp — fully local, offline transcription via the C++ Whisper binding. Accepts local file paths only.

All three expose the same Transcription handle, the same TranscriptionRequest, and return the same TranscriptionResult shape.

Overview

TranscriptionRequest carries either a URL (audio_url) or a local path (via from_file). The result bundles the raw text, timestamped segments, and detected language along with the usual timing / cost / metadata block.

fal.ai (remote)

from blazen import Transcription, TranscriptionRequest, FalOptions

transcriber = Transcription.fal(options=FalOptions(api_key="fal-..."))

result = await transcriber.transcribe(TranscriptionRequest(
    audio_url="https://example.com/interview.mp3",
    language="en",
    diarize=True,
))

print(result.text)
for seg in result.segments:
    speaker = seg.speaker or "unknown"
    print(f"[{seg.start_seconds:.2f}-{seg.end_seconds:.2f}] {speaker}: {seg.text}")

import { Transcription } from "blazen";

const transcriber = Transcription.fal({ apiKey: "fal-..." });

const result = await transcriber.transcribe({
  audioUrl: "https://example.com/interview.mp3",
  language: "en",
  diarize: true,
});

console.log(result.text);
for (const seg of result.segments) {
  console.log(`[${seg.startSeconds}-${seg.endSeconds}]`, seg.speaker ?? "unknown", seg.text);
}

use blazen_llm::compute::{Transcription, TranscriptionRequest};
use blazen_llm::providers::fal::FalProvider;

let fal = FalProvider::new(std::env::var("FAL_KEY")?);

let result = fal
    .transcribe(
        TranscriptionRequest::new("https://example.com/interview.mp3")
            .with_language("en")
            .with_diarize(true),
    )
    .await?;

println!("{}", result.text);
for seg in &result.segments {
    println!("[{:.2}-{:.2}] {}", seg.start_seconds, seg.end_seconds, seg.text);
}

Candle Whisper (local, pure-Rust)

The Candle backend lives in the blazen-audio-stt crate and runs Whisper through candle-transformers — no C++ dependency, no whisper.cpp build. It’s the recommended choice for environments where the C++ binding is unwanted (cross-compilation targets that hate cmake, wasm32-wasi, locked-down build sandboxes, or simply a smaller native footprint).

Two backends are exposed from blazen-audio-stt:

CandleWhisperBackend — one-shot file transcription, mirrors the whisper.cpp ergonomics.
WhisperStreamingBackend — the same Candle decoder fronted by Silero VAD and a sliding-window chunker, for live microphone / long-form audio.

Both share the same SttBackend trait and TranscriptionResult shape as the other backends. Weights stream from Hugging Face on first use and are cached on disk afterwards; no API key is required.

use std::path::Path;
use blazen_audio_stt::{CandleWhisperBackend, CandleWhisperConfig, SttBackend};

let backend = CandleWhisperBackend::new(CandleWhisperConfig {
    language: Some("en".into()),
    ..Default::default()
});

let result = backend
    .transcribe(Path::new("/path/to/audio.wav"), None)
    .await?;

println!("{}", result.text);

For streaming, swap in WhisperStreamingBackend:

use blazen_audio_stt::{WhisperStreamingBackend, WhisperStreamingConfig};

let streaming = WhisperStreamingBackend::new(WhisperStreamingConfig {
    model_id: "openai/whisper-base".into(),
    chunk_seconds: 30.0,
    chunk_overlap_seconds: 5.0,
    ..Default::default()
});

GPU acceleration is selected by enabling the matching candle-core feature (cuda, metal, accelerate) on blazen-audio-stt at build time.

whisper.cpp (local)

whisper.cpp runs entirely on-device. The first call downloads the GGML model file (32 MB for Tiny, up to 3.1 GB for LargeV3) into the cache directory and reuses it afterwards. No API key or network access is required for subsequent runs.

Audio input must be 16-bit PCM mono WAV at 16 kHz. URL sources are not supported — use TranscriptionRequest.from_file with a local path.

from blazen import Transcription, TranscriptionRequest, WhisperOptions, WhisperModel

transcriber = Transcription.whispercpp(options=WhisperOptions(
    model=WhisperModel.Base,
    language="en",
    cache_dir="/tmp/whisper-models",
))

result = await transcriber.transcribe(
    TranscriptionRequest.from_file("/path/to/audio.wav")
)
print(result.text)

import { Transcription } from "blazen";

// Node.js currently exposes whisper.cpp via the Rust crate feature flag --
// see the Rust example below for full control over model size and device.
const transcriber = Transcription.fal(); // for remote

use blazen_audio_whispercpp::{WhisperCppProvider, WhisperOptions, WhisperModel};
use blazen_llm::compute::{Transcription, TranscriptionRequest};

let provider = WhisperCppProvider::new(
    WhisperOptions::new()
        .with_model(WhisperModel::Base)
        .with_language("en"),
)?;

let result = provider
    .transcribe(TranscriptionRequest::from_file("/path/to/audio.wav"))
    .await?;

println!("{}", result.text);

Model sizes

Variant	Parameters	Download size	Relative speed
`Tiny`	39 M	~32 MB	fastest
`Base`	74 M	~74 MB	fast
`Small`	244 M	~244 MB	balanced
`Medium`	769 M	~769 MB	slower
`LargeV3`	1550 M	~3.1 GB	highest quality

Enable GPU acceleration with the cuda, metal, or coreml feature flags on blazen-audio-whispercpp at build time.

TranscriptionResult shape

Field	Type	Description
`text`	`str`	Full transcript concatenated from all segments.
`segments`	`list[TranscriptionSegment]`	Timestamped utterances with optional `speaker` labels when diarization is enabled.
`language`	`str \| None`	Detected ISO 639-1 language code.
`timing`	`RequestTiming`	Queue, execution, and total latency breakdown.
`cost`	`float \| None`	USD cost if reported by the provider.
`metadata`	`dict`	Raw provider-specific fields.

Custom backends

Subclass Transcription (Python/Node) or implement the Transcription trait (Rust) to plug in your own provider — AssemblyAI, Deepgram, a self-hosted Whisper endpoint, etc. See Custom Providers for the full pattern.