Media Generation

Generate images, video, audio, and 3D models with fal.ai

Blazen exposes media generation through both remote providers (e.g. fal.ai) and a growing set of native local backends. The remote FalProvider covers 600+ models across image, video, TTS, music, 3D, background removal, and upscaling, while native crates (blazen-audio-tts, blazen-audio-vc, blazen-audio-music, blazen-audio-codec, blazen-3d) implement the same capability traits without leaving the host. The FalProvider also acts as an EmbeddingModel and Model, so a single handle covers every fal capability.

Overview

The provider implements a family of capability traits (ImageGeneration, VideoGeneration, AudioGeneration, ThreeDGeneration, Transcription, BackgroundRemoval). Each capability takes a typed request (ImageRequest, VideoRequest, SpeechRequest, MusicRequest, ThreeDRequest) and returns a typed result containing one or more MediaOutput objects with a URL, base64 payload, or raw text content.

Authentication: pass an API key via options, or set the FAL_KEY environment variable.

Image generation

from blazen import FalProvider, FalOptions, ImageRequest

fal = FalProvider(options=FalOptions(api_key="fal-..."))

result = await fal.generate_image(ImageRequest(
    prompt="a cat astronaut on Mars, cinematic lighting",
    width=1024,
    height=1024,
    num_images=2,
))

for img in result.images:
    print(img.media.url, img.width, img.height)
import { FalProvider } from "blazen";

const fal = FalProvider.create({ apiKey: "fal-..." });

const result = await fal.generateImage({
  prompt: "a cat astronaut on Mars, cinematic lighting",
  width: 1024,
  height: 1024,
  numImages: 2,
});

for (const img of result.images) {
  console.log(img.media.url, img.width, img.height);
}
use blazen_llm::compute::{ImageGeneration, ImageRequest};
use blazen_llm::providers::fal::FalProvider;

let fal = FalProvider::new(std::env::var("FAL_KEY")?);

let result = fal
    .generate_image(
        ImageRequest::new("a cat astronaut on Mars, cinematic lighting")
            .with_size(1024, 1024)
            .with_count(2),
    )
    .await?;

for img in &result.images {
    if let Some(url) = &img.media.url {
        println!("{url}");
    }
}

Upscaling and background removal

from blazen import UpscaleRequest, BackgroundRemovalRequest

upscaled = await fal.upscale_image(UpscaleRequest(
    image_url="https://example.com/small.png",
    scale=4.0,
))

no_bg = await fal.remove_background(BackgroundRemovalRequest(
    image_url="https://example.com/product.jpg",
))
const upscaled = await fal.upscaleImage({
  imageUrl: "https://example.com/small.png",
  scale: 4,
});

const noBg = await fal.removeBackground({
  imageUrl: "https://example.com/product.jpg",
});

FalProvider also exposes upscale_image_aura, upscale_image_clarity, and upscale_image_creative for the respective fal upscaler apps.

Video generation

Both text-to-video and image-to-video are supported:

from blazen import VideoRequest

clip = await fal.text_to_video(VideoRequest(
    prompt="a drone flying through a sunlit forest",
    duration_seconds=5.0,
    width=1920,
    height=1080,
))
print(clip.video.media.url, clip.video.duration_seconds)

from_image = await fal.image_to_video(VideoRequest(
    prompt="animate this painting",
    image_url="https://example.com/input.png",
    duration_seconds=4.0,
))
const clip = await fal.textToVideo({
  prompt: "a drone flying through a sunlit forest",
  durationSeconds: 5,
  width: 1920,
  height: 1080,
});

const fromImage = await fal.imageToVideo({
  prompt: "animate this painting",
  imageUrl: "https://example.com/input.png",
  durationSeconds: 4,
});

Text-to-speech, music, and sound effects

from blazen import SpeechRequest, MusicRequest

speech = await fal.text_to_speech(SpeechRequest(
    text="Hello, world!",
    voice="af_heart",
    speed=1.0,
))
audio_url = speech.audio[0].media.url

music = await fal.generate_music(MusicRequest(
    prompt="upbeat lo-fi hip-hop",
    duration_seconds=30.0,
))

sfx = await fal.generate_sfx(MusicRequest(prompt="thunder clap"))
const speech = await fal.textToSpeech({
  text: "Hello, world!",
  voice: "af_heart",
  speed: 1,
});

const music = await fal.generateMusic({
  prompt: "upbeat lo-fi hip-hop",
  durationSeconds: 30,
});

const sfx = await fal.generateSfx({ prompt: "thunder clap" });
use blazen_llm::compute::{AudioGeneration, MusicRequest, SpeechRequest};

let speech = fal
    .text_to_speech(
        SpeechRequest::new("Hello, world!")
            .with_voice("af_heart")
            .with_speed(1.0),
    )
    .await?;

let music = fal
    .generate_music(MusicRequest::new("upbeat lo-fi hip-hop").with_duration(30.0))
    .await?;

Native TTS engines

The blazen-audio-tts crate ships several local backends that implement the same SpeechSynthesis trait used by the fal client. An OpenAI-compatible HTTP backend is on by default; opt into native engines with crate features:

  • anytts — multiplexed loader for Kokoro-82M, VibeVoice, and Qwen3-TTS
  • bark — Suno’s Bark
  • f5-tts — F5-TTS flow-matching synthesizer
  • piper — Piper neural TTS (ONNX runtime)
use blazen_audio_tts::{AnyTtsBackend, SpeechSynthesis, SpeechRequest};

let tts = AnyTtsBackend::kokoro_from_pretrained().await?;
let audio = tts
    .synthesize(SpeechRequest::new("Hello from Kokoro").with_voice("af_heart"))
    .await?;

Voice conversion (RVC)

The blazen-audio-vc crate (rvc feature) performs end-to-end RVC inference natively. It loads HuBERT-base v2 ContentVec for content embeddings and applies a v2 RVC speaker model on top; v1 RVC checkpoints are rejected at load time due to topology mismatch. Pass a reference utterance plus a source clip and you get the converted audio back.

use blazen_audio_vc::{RvcBackend, VoiceConversion, VoiceConversionRequest};

let vc = RvcBackend::from_pretrained("./models/rvc/my-voice.pth").await?;
let converted = vc
    .convert(VoiceConversionRequest::new(source_audio).with_reference(reference_audio))
    .await?;

Native music & SFX

The blazen-audio-music crate implements MusicGeneration (and SFX) using local diffusion / autoregressive models. Enable the backend you need:

  • musicgen — Meta’s MusicGen (text-to-music)
  • audiogen — Meta’s AudioGen (text-to-SFX/ambience)
  • stable-audio — Stability’s Stable Audio Open

Each backend exposes the same MusicRequest API used by the fal path, so you can swap providers without touching call sites.

Audio codecs

The blazen-audio-codec crate wraps neural audio codecs for compressing waveforms into discrete token streams (useful as inputs to LM-style audio models). Features:

  • encodec — Meta’s EnCodec (32 codebooks at the default 24 kbps bandwidth)
  • dac — Descript Audio Codec
  • snac — SNAC (multi-scale neural audio codec)
use blazen_audio_codec::{EncodecBackend, AudioCodec};

let codec = EncodecBackend::from_pretrained_24khz().await?;
let codes = codec.encode(&waveform).await?;
let recon = codec.decode(&codes).await?;

3D generation

from blazen import ThreeDRequest

mesh = await fal.generate_3d(ThreeDRequest(
    prompt="a low-poly spaceship",
    format="glb",
))

from_image = await fal.generate_3d(ThreeDRequest.from_image(
    "https://example.com/photo.png",
).with_format("obj"))
const mesh = await fal.generate3d({
  prompt: "a low-poly spaceship",
  format: "glb",
});

Native 3D (TripoSR)

The blazen-3d crate (triposr feature) ships a fully native image-to-mesh pipeline — no remote call, no API key. The backend implements the same ThreeDGeneration trait as the fal path and returns a mesh ready to serialize as GLB/OBJ.

use blazen_3d::{TripoSrBackend, ThreeDGeneration, ThreeDRequest};

let triposr = TripoSrBackend::from_pretrained().await?;
let mesh = triposr
    .generate_3d(ThreeDRequest::from_image_path("./photo.png").with_format("glb"))
    .await?;

Compat3D HTTP proxy

The blazen-3d crate also exposes a threed-compat-proxy feature that talks to a local HTTP service implementing the Compat3D surface. This adds the verbs that native single-shot generators don’t cover: texturize, rig, refine, and animate against an existing mesh.

The proxy is wired through all 8 bindings — Rust, Python, Node, WASM, Go, Swift, Kotlin, and Ruby — so the same API surface is available everywhere Blazen runs.

Output format

Every result wraps one or more MediaOutput records. Each output exposes:

FieldTypeDescription
urlstr | NoneDownloadable URL if the provider returned one.
base64str | NoneInline base64 payload, when the provider returned raw bytes.
raw_contentstr | NoneRaw text for text-based formats (SVG, GLTF JSON, OBJ).
media_typeMediaTypeFormat enum plus mime() / extension() / is_image() helpers.
file_sizeint | NoneByte count if reported.
metadatadictArbitrary provider-specific fields.

See also