Audio Transcription

Convert speech to text with fal.ai, local whisper.cpp, or pure-Rust Candle Whisper

Blazen’s Transcription provider converts audio into text with optional timestamped segments, language detection, and speaker diarization. Three backends ship out of the box:

  • fal.ai — remote Whisper hosted on fal’s compute platform. Accepts URL audio sources.
  • Candle Whisper — fully local, offline, pure-Rust transcription via candle-transformers. Lives in blazen-audio-stt.
  • whisper.cpp — fully local, offline transcription via the C++ Whisper binding. Accepts local file paths only.

All three expose the same Transcription handle, the same TranscriptionRequest, and return the same TranscriptionResult shape.

Overview

TranscriptionRequest carries either a URL (audio_url) or a local path (via from_file). The result bundles the raw text, timestamped segments, and detected language along with the usual timing / cost / metadata block.

fal.ai (remote)

from blazen import Transcription, TranscriptionRequest, FalOptions

transcriber = Transcription.fal(options=FalOptions(api_key="fal-..."))

result = await transcriber.transcribe(TranscriptionRequest(
    audio_url="https://example.com/interview.mp3",
    language="en",
    diarize=True,
))

print(result.text)
for seg in result.segments:
    speaker = seg.speaker or "unknown"
    print(f"[{seg.start_seconds:.2f}-{seg.end_seconds:.2f}] {speaker}: {seg.text}")
import { Transcription } from "blazen";

const transcriber = Transcription.fal({ apiKey: "fal-..." });

const result = await transcriber.transcribe({
  audioUrl: "https://example.com/interview.mp3",
  language: "en",
  diarize: true,
});

console.log(result.text);
for (const seg of result.segments) {
  console.log(`[${seg.startSeconds}-${seg.endSeconds}]`, seg.speaker ?? "unknown", seg.text);
}
use blazen_llm::compute::{Transcription, TranscriptionRequest};
use blazen_llm::providers::fal::FalProvider;

let fal = FalProvider::new(std::env::var("FAL_KEY")?);

let result = fal
    .transcribe(
        TranscriptionRequest::new("https://example.com/interview.mp3")
            .with_language("en")
            .with_diarize(true),
    )
    .await?;

println!("{}", result.text);
for seg in &result.segments {
    println!("[{:.2}-{:.2}] {}", seg.start_seconds, seg.end_seconds, seg.text);
}

Candle Whisper (local, pure-Rust)

The Candle backend lives in the blazen-audio-stt crate and runs Whisper through candle-transformers — no C++ dependency, no whisper.cpp build. It’s the recommended choice for environments where the C++ binding is unwanted (cross-compilation targets that hate cmake, wasm32-wasi, locked-down build sandboxes, or simply a smaller native footprint).

Two backends are exposed from blazen-audio-stt:

  • CandleWhisperBackend — one-shot file transcription, mirrors the whisper.cpp ergonomics.
  • WhisperStreamingBackend — the same Candle decoder fronted by Silero VAD and a sliding-window chunker, for live microphone / long-form audio.

Both share the same SttBackend trait and TranscriptionResult shape as the other backends. Weights stream from Hugging Face on first use and are cached on disk afterwards; no API key is required.

use std::path::Path;
use blazen_audio_stt::{CandleWhisperBackend, CandleWhisperConfig, SttBackend};

let backend = CandleWhisperBackend::new(CandleWhisperConfig {
    language: Some("en".into()),
    ..Default::default()
});

let result = backend
    .transcribe(Path::new("/path/to/audio.wav"), None)
    .await?;

println!("{}", result.text);

For streaming, swap in WhisperStreamingBackend:

use blazen_audio_stt::{WhisperStreamingBackend, WhisperStreamingConfig};

let streaming = WhisperStreamingBackend::new(WhisperStreamingConfig {
    model_id: "openai/whisper-base".into(),
    chunk_seconds: 30.0,
    chunk_overlap_seconds: 5.0,
    ..Default::default()
});

GPU acceleration is selected by enabling the matching candle-core feature (cuda, metal, accelerate) on blazen-audio-stt at build time.

whisper.cpp (local)

whisper.cpp runs entirely on-device. The first call downloads the GGML model file (32 MB for Tiny, up to 3.1 GB for LargeV3) into the cache directory and reuses it afterwards. No API key or network access is required for subsequent runs.

Audio input must be 16-bit PCM mono WAV at 16 kHz. URL sources are not supported — use TranscriptionRequest.from_file with a local path.

from blazen import Transcription, TranscriptionRequest, WhisperOptions, WhisperModel

transcriber = Transcription.whispercpp(options=WhisperOptions(
    model=WhisperModel.Base,
    language="en",
    cache_dir="/tmp/whisper-models",
))

result = await transcriber.transcribe(
    TranscriptionRequest.from_file("/path/to/audio.wav")
)
print(result.text)
import { Transcription } from "blazen";

// Node.js currently exposes whisper.cpp via the Rust crate feature flag --
// see the Rust example below for full control over model size and device.
const transcriber = Transcription.fal(); // for remote
use blazen_audio_whispercpp::{WhisperCppProvider, WhisperOptions, WhisperModel};
use blazen_llm::compute::{Transcription, TranscriptionRequest};

let provider = WhisperCppProvider::new(
    WhisperOptions::new()
        .with_model(WhisperModel::Base)
        .with_language("en"),
)?;

let result = provider
    .transcribe(TranscriptionRequest::from_file("/path/to/audio.wav"))
    .await?;

println!("{}", result.text);

Model sizes

VariantParametersDownload sizeRelative speed
Tiny39 M~32 MBfastest
Base74 M~74 MBfast
Small244 M~244 MBbalanced
Medium769 M~769 MBslower
LargeV31550 M~3.1 GBhighest quality

Enable GPU acceleration with the cuda, metal, or coreml feature flags on blazen-audio-whispercpp at build time.

TranscriptionResult shape

FieldTypeDescription
textstrFull transcript concatenated from all segments.
segmentslist[TranscriptionSegment]Timestamped utterances with optional speaker labels when diarization is enabled.
languagestr | NoneDetected ISO 639-1 language code.
timingRequestTimingQueue, execution, and total latency breakdown.
costfloat | NoneUSD cost if reported by the provider.
metadatadictRaw provider-specific fields.

Custom backends

Subclass Transcription (Python/Node) or implement the Transcription trait (Rust) to plug in your own provider — AssemblyAI, Deepgram, a self-hosted Whisper endpoint, etc. See Custom Providers for the full pattern.

See also