Media Generation
Generate images, video, audio, and 3D models with fal.ai
Blazen exposes media generation through both remote providers (e.g. fal.ai) and a growing set of native local backends. The remote FalProvider covers 600+ models across image, video, TTS, music, 3D, background removal, and upscaling, while native crates (blazen-audio-tts, blazen-audio-vc, blazen-audio-music, blazen-audio-codec, blazen-3d) implement the same capability traits without leaving the host. The FalProvider also acts as an EmbeddingModel and Model, so a single handle covers every fal capability.
Overview
The provider implements a family of capability traits (ImageGeneration, VideoGeneration, AudioGeneration, ThreeDGeneration, Transcription, BackgroundRemoval). Each capability takes a typed request (ImageRequest, VideoRequest, SpeechRequest, MusicRequest, ThreeDRequest) and returns a typed result containing one or more MediaOutput objects with a URL, base64 payload, or raw text content.
Authentication: pass an API key via options, or set the FAL_KEY environment variable.
Image generation
from blazen import FalProvider, FalOptions, ImageRequest
fal = FalProvider(options=FalOptions(api_key="fal-..."))
result = await fal.generate_image(ImageRequest(
prompt="a cat astronaut on Mars, cinematic lighting",
width=1024,
height=1024,
num_images=2,
))
for img in result.images:
print(img.media.url, img.width, img.height)
import { FalProvider } from "blazen";
const fal = FalProvider.create({ apiKey: "fal-..." });
const result = await fal.generateImage({
prompt: "a cat astronaut on Mars, cinematic lighting",
width: 1024,
height: 1024,
numImages: 2,
});
for (const img of result.images) {
console.log(img.media.url, img.width, img.height);
}
use blazen_llm::compute::{ImageGeneration, ImageRequest};
use blazen_llm::providers::fal::FalProvider;
let fal = FalProvider::new(std::env::var("FAL_KEY")?);
let result = fal
.generate_image(
ImageRequest::new("a cat astronaut on Mars, cinematic lighting")
.with_size(1024, 1024)
.with_count(2),
)
.await?;
for img in &result.images {
if let Some(url) = &img.media.url {
println!("{url}");
}
}
Upscaling and background removal
from blazen import UpscaleRequest, BackgroundRemovalRequest
upscaled = await fal.upscale_image(UpscaleRequest(
image_url="https://example.com/small.png",
scale=4.0,
))
no_bg = await fal.remove_background(BackgroundRemovalRequest(
image_url="https://example.com/product.jpg",
))
const upscaled = await fal.upscaleImage({
imageUrl: "https://example.com/small.png",
scale: 4,
});
const noBg = await fal.removeBackground({
imageUrl: "https://example.com/product.jpg",
});
FalProvider also exposes upscale_image_aura, upscale_image_clarity, and upscale_image_creative for the respective fal upscaler apps.
Video generation
Both text-to-video and image-to-video are supported:
from blazen import VideoRequest
clip = await fal.text_to_video(VideoRequest(
prompt="a drone flying through a sunlit forest",
duration_seconds=5.0,
width=1920,
height=1080,
))
print(clip.video.media.url, clip.video.duration_seconds)
from_image = await fal.image_to_video(VideoRequest(
prompt="animate this painting",
image_url="https://example.com/input.png",
duration_seconds=4.0,
))
const clip = await fal.textToVideo({
prompt: "a drone flying through a sunlit forest",
durationSeconds: 5,
width: 1920,
height: 1080,
});
const fromImage = await fal.imageToVideo({
prompt: "animate this painting",
imageUrl: "https://example.com/input.png",
durationSeconds: 4,
});
Text-to-speech, music, and sound effects
from blazen import SpeechRequest, MusicRequest
speech = await fal.text_to_speech(SpeechRequest(
text="Hello, world!",
voice="af_heart",
speed=1.0,
))
audio_url = speech.audio[0].media.url
music = await fal.generate_music(MusicRequest(
prompt="upbeat lo-fi hip-hop",
duration_seconds=30.0,
))
sfx = await fal.generate_sfx(MusicRequest(prompt="thunder clap"))
const speech = await fal.textToSpeech({
text: "Hello, world!",
voice: "af_heart",
speed: 1,
});
const music = await fal.generateMusic({
prompt: "upbeat lo-fi hip-hop",
durationSeconds: 30,
});
const sfx = await fal.generateSfx({ prompt: "thunder clap" });
use blazen_llm::compute::{AudioGeneration, MusicRequest, SpeechRequest};
let speech = fal
.text_to_speech(
SpeechRequest::new("Hello, world!")
.with_voice("af_heart")
.with_speed(1.0),
)
.await?;
let music = fal
.generate_music(MusicRequest::new("upbeat lo-fi hip-hop").with_duration(30.0))
.await?;
Native TTS engines
The blazen-audio-tts crate ships several local backends that implement the same SpeechSynthesis trait used by the fal client. An OpenAI-compatible HTTP backend is on by default; opt into native engines with crate features:
anytts— multiplexed loader for Kokoro-82M, VibeVoice, and Qwen3-TTSbark— Suno’s Barkf5-tts— F5-TTS flow-matching synthesizerpiper— Piper neural TTS (ONNX runtime)
use blazen_audio_tts::{AnyTtsBackend, SpeechSynthesis, SpeechRequest};
let tts = AnyTtsBackend::kokoro_from_pretrained().await?;
let audio = tts
.synthesize(SpeechRequest::new("Hello from Kokoro").with_voice("af_heart"))
.await?;
Voice conversion (RVC)
The blazen-audio-vc crate (rvc feature) performs end-to-end RVC inference natively. It loads HuBERT-base v2 ContentVec for content embeddings and applies a v2 RVC speaker model on top; v1 RVC checkpoints are rejected at load time due to topology mismatch. Pass a reference utterance plus a source clip and you get the converted audio back.
use blazen_audio_vc::{RvcBackend, VoiceConversion, VoiceConversionRequest};
let vc = RvcBackend::from_pretrained("./models/rvc/my-voice.pth").await?;
let converted = vc
.convert(VoiceConversionRequest::new(source_audio).with_reference(reference_audio))
.await?;
Native music & SFX
The blazen-audio-music crate implements MusicGeneration (and SFX) using local diffusion / autoregressive models. Enable the backend you need:
musicgen— Meta’s MusicGen (text-to-music)audiogen— Meta’s AudioGen (text-to-SFX/ambience)stable-audio— Stability’s Stable Audio Open
Each backend exposes the same MusicRequest API used by the fal path, so you can swap providers without touching call sites.
Audio codecs
The blazen-audio-codec crate wraps neural audio codecs for compressing waveforms into discrete token streams (useful as inputs to LM-style audio models). Features:
encodec— Meta’s EnCodec (32 codebooks at the default 24 kbps bandwidth)dac— Descript Audio Codecsnac— SNAC (multi-scale neural audio codec)
use blazen_audio_codec::{EncodecBackend, AudioCodec};
let codec = EncodecBackend::from_pretrained_24khz().await?;
let codes = codec.encode(&waveform).await?;
let recon = codec.decode(&codes).await?;
3D generation
from blazen import ThreeDRequest
mesh = await fal.generate_3d(ThreeDRequest(
prompt="a low-poly spaceship",
format="glb",
))
from_image = await fal.generate_3d(ThreeDRequest.from_image(
"https://example.com/photo.png",
).with_format("obj"))
const mesh = await fal.generate3d({
prompt: "a low-poly spaceship",
format: "glb",
});
Native 3D (TripoSR)
The blazen-3d crate (triposr feature) ships a fully native image-to-mesh pipeline — no remote call, no API key. The backend implements the same ThreeDGeneration trait as the fal path and returns a mesh ready to serialize as GLB/OBJ.
use blazen_3d::{TripoSrBackend, ThreeDGeneration, ThreeDRequest};
let triposr = TripoSrBackend::from_pretrained().await?;
let mesh = triposr
.generate_3d(ThreeDRequest::from_image_path("./photo.png").with_format("glb"))
.await?;
Compat3D HTTP proxy
The blazen-3d crate also exposes a threed-compat-proxy feature that talks to a local HTTP service implementing the Compat3D surface. This adds the verbs that native single-shot generators don’t cover: texturize, rig, refine, and animate against an existing mesh.
The proxy is wired through all 8 bindings — Rust, Python, Node, WASM, Go, Swift, Kotlin, and Ruby — so the same API surface is available everywhere Blazen runs.
Output format
Every result wraps one or more MediaOutput records. Each output exposes:
| Field | Type | Description |
|---|---|---|
url | str | None | Downloadable URL if the provider returned one. |
base64 | str | None | Inline base64 payload, when the provider returned raw bytes. |
raw_content | str | None | Raw text for text-based formats (SVG, GLTF JSON, OBJ). |
media_type | MediaType | Format enum plus mime() / extension() / is_image() helpers. |
file_size | int | None | Byte count if reported. |
metadata | dict | Arbitrary provider-specific fields. |
See also
- Transcription — convert audio to text with fal or whisper.cpp
- Custom Providers — wrap your own image/video/audio backend
- Batch Completions — run many LLM prompts concurrently