Always-On Agents & Prompt Caching
Build persistent live workflows and bots, signal them while running, and cut input cost with prompt caching
Most workflows run once and exit. An always-on workflow does the opposite: it starts, idles, and stays alive waiting for events you feed it over its lifetime — a chat session, an actor mailbox, a long-running supervisor. This guide covers the two halves of running agents that live for hours or days: the live workflow / Bot lifecycle controls, and prompt caching, which keeps the per-turn input cost of a long conversation from growing without bound.
Live workflows
.live() reconfigures a workflow as a persistent, always-on run. It does three things at once:
- Disables the execution timeout — the run continues until you explicitly stop it (an unbounded run, not a one-shot).
- Enables lenient routing — events with no registered handler are dropped with a warning instead of terminating the loop with a “no handler” error. This is what lets the loop survive the initial start event and any stray traffic.
- Bounds the in-memory history buffer — to a default cap, so an unbounded run does not grow its event history forever.
use blazen_core::builder::WorkflowBuilder;
use std::time::Duration;
let workflow = WorkflowBuilder::new("supervisor")
.step(my_step)
.live() // unlimited timeout + lenient routing
.idle_timeout(Duration::from_secs(300))
.cost_budget_usd(5.0)
.max_history(1000)
.build()?;
let handler = workflow.run(serde_json::Value::Null).await?;
Lifecycle controls
These builder knobs are the guard rails for an unbounded run:
.idle_timeout(duration)— shut the workflow down after this much wall-time elapses with no event processed. Surfaces as aIdleTimeouterror on the run’s terminal result. Use it so an abandoned session does not idle forever..cost_budget_usd(ceiling)— abort the run once accumulated LLM cost exceeds this USD ceiling. It is a soft cap checked between processed events, so a single in-flight turn can overshoot slightly; surfaces as aCostBudgetExceedederror..max_history(n)— cap the retained in-memory history buffer atnevents, ring-trimming the oldest entries on overflow..live()sets a default cap automatically; call.max_history()to override it.
Signals: injecting events into a running workflow
A live loop is driven by events you push in after it starts. handler.emit(event) injects an event into the running workflow — the Temporal-signal / actor-mailbox pattern.
// Drive the live loop by emitting events into its mailbox.
handler.emit(MyInboundEvent { /* ... */ })?;
The important semantic: emit injects an event into the loop’s stream; it does not unpark a paused workflow. A workflow that has suspended on a human-in-the-loop gate or a persisted pause is resumed through the snapshot/resume path, not through emit. Use emit to feed work to a loop that is running and idle, the way you would post a message to an actor’s mailbox or send a signal to a running Temporal workflow.
The Bot abstraction
Bot is the opinionated, batteries-included always-on agent built on top of a .live() workflow. Instead of wiring the step, the conversation memory, and the lifecycle controls yourself, you configure a builder and get a running conversational agent. Internally each turn loads the persisted conversation window, appends the user’s message, optionally compacts old history, runs the LLM + tool agentic loop, persists the updated window, and folds the turn’s usage/cost into the workflow’s cost-budget accumulator.
Builder → send / responses / shutdown / snapshot
from blazen import Bot, Model
bot = await Bot.builder(
Model.openai(),
system_prompt="You are a concise assistant.",
history_tokens=8192, # conversation-memory token budget (ChatWindow)
max_iterations=10, # cap agentic tool-call rounds per turn
idle_timeout_ms=300_000, # shut down after 5 min idle
cost_budget_usd=1.0, # abort once cost exceeds $1
inject_time=True, # prepend current UTC time + add current-time tool
summarize=False, # compact old history via summarization
).build()
# Subscribe BEFORE sending so the first reply is never missed.
stream = bot.responses()
bot.send("Hello!") # non-blocking: drives one agentic turn
async for reply in stream:
print(reply)
bot.shutdown() # terminate the live event loop
import { Model, BotBuilder } from "blazen";
const model = Model.openai({ apiKey: process.env.OPENAI_API_KEY });
const bot = await new BotBuilder(model)
.systemPrompt("You are concise.")
.historyTokens(8192)
.maxIterations(10)
.idleTimeout(300) // seconds
.costBudgetUsd(1.0)
.injectTime(true)
.build();
// `responses` pushes each reply to the callback until the bot shuts down.
bot.responses((reply) => console.log(reply.text));
bot.send("Hello!"); // non-blocking
// ... later ...
bot.shutdown();
use blazen_core::bot::Bot;
use blazen_llm::providers::openai::OpenAiProvider;
use std::sync::Arc;
use std::time::Duration;
use tokio_stream::StreamExt;
let model = Arc::new(OpenAiProvider::from_env()?);
let bot = Bot::builder(model)
.system_prompt("You are concise.")
.history_tokens(8192)
.max_iterations(10)
.idle_timeout(Duration::from_secs(300))
.cost_budget_usd(1.0)
.build()
.await?;
let mut responses = bot.responses(); // subscribe first
bot.send("Hello!")?; // non-blocking
if let Some(reply) = responses.next().await {
println!("{}", reply.text);
}
bot.shutdown()?;
The four lifecycle methods are:
send(text)— emit a message into the bot, driving one agentic turn. Non-blocking; the reply arrives on theresponsesstream. Fails once the loop has exited (aftershutdown, an idle timeout, or a cost-budget breach).responses()— subscribe to the bot’s replies. Each subscription starts from the current point in time; replies emitted before you subscribe are not replayed, so subscribe before the firstsend.shutdown()— terminate the live event loop. After this,sendfails.snapshot()— capture a serializable snapshot of the bot’s state (including its persisted conversation memory) without stopping the loop.
Conversation memory across turns
A bot remembers. Each turn’s user message and assistant reply are persisted into a ChatWindow — a rolling buffer capped at history_tokens — and reloaded on the next turn, so the model sees the running conversation. When the window exceeds its token budget, the oldest non-system messages are evicted. See the Chat Window guide for the eviction semantics.
Current-time injection
With inject_time enabled (the default), each turn prepends the current UTC time to the system prompt and registers a current-time tool so the model can fetch the time on demand. Set it to false for deterministic, time-independent turns (useful in tests).
Summarization of old history
With summarize enabled, the bot compacts the older prefix of the conversation via LLM summarization before each turn, keeping recent messages verbatim. This lets a long-running chat retain the gist of old context instead of merely dropping the oldest messages when the window fills. It costs an extra summarization call per turn, so it is off by default.
Pause / resume via snapshots
snapshot() captures the bot’s full state — including its persisted ChatWindow — as a serializable WorkflowSnapshot without stopping the loop. Persist that snapshot, restart the process later, and resume the workflow to continue the same conversation from exactly where it left off. This is the path for pausing and resuming a long-lived agent across restarts (distinct from emit, which feeds a live loop rather than reviving a stopped one).
Prompt caching
For a long-lived agent, the system prompt, tool definitions, and accumulated history are re-sent on every turn. Prompt caching lets the provider serve that stable prefix from its own cache instead of re-processing it, which cuts input-token cost and latency on every turn after the first. Blazen turns this on by default wherever the provider supports it.
CachePolicy
A request’s caching behavior is set by its CachePolicy:
Auto(default) — the provider caches its stable prefix using its native mechanism. On by default; you do not have to do anything.Off— disable prompt caching for this request (where the provider allows it).Explicit { breakpoints }— pin an exact number of cache breakpoints at the end of the stable prefix. Providers cap the count (Anthropic allows 4); providers without explicit-breakpoint support treat it asAuto.Handle { id }— reference a provider-managed cache by id (e.g. a GeminicachedContents/{id}resource). Providers without managed caches treat it asAuto.
from blazen import ModelRequest, ChatMessage, CachePolicy
# Explicit breakpoints (Anthropic and other supporting providers):
req = ModelRequest(messages=[ChatMessage.user("...")],
cache=CachePolicy.explicit(2))
# Disable caching for a one-off request:
req = ModelRequest(messages=[ChatMessage.user("...")],
cache=CachePolicy.off())
import { CachePolicy } from "blazen";
CachePolicy.auto(); // provider picks breakpoints (default)
CachePolicy.off(); // disable caching
CachePolicy.explicit(2); // place 2 explicit breakpoints
CachePolicy.handle("cachedContents/abc123"); // reference a managed cache
use blazen_llm::types::cache_control::CachePolicy;
let policy = CachePolicy::Explicit { breakpoints: 2 };
let off = CachePolicy::Off;
let handle = CachePolicy::Handle { id: "cachedContents/abc123".into() };
Per-provider support
Caching mechanisms differ by provider. Under Auto, Blazen applies whatever the provider supports; providers with no caching simply ignore the policy.
| Provider | Prompt caching |
|---|---|
| Anthropic | Ephemeral cache breakpoints (Auto + Explicit) |
| OpenAI, Azure | Automatic prefix caching |
| Groq, Together, xAI, Fireworks, Mistral, Moonshot, OpenRouter, Fal, DeepSeek | Automatic prefix caching |
| Bedrock | Automatic prefix caching |
| Gemini | Implicit automatic caching + explicit managed cachedContents (Handle) |
| Perplexity, Cohere | None |
Cache token metrics
When a provider reports cache activity, it surfaces on the response’s token usage:
cached_input_tokens— input tokens that were served from the prompt cache (a cache hit). These are the tokens you did not pay full price to re-process.cache_creation_tokens— input tokens that were written into the provider’s prompt cache (a cache write, charged at the cache-creation rate on the turn that populates it).
resp = await model.complete([ChatMessage.user("...")])
print(resp.usage.cached_input_tokens) # tokens served from cache (hit)
print(resp.usage.cache_creation_tokens) # tokens written to cache (write)
const resp = await model.complete([ChatMessage.user("...")]);
console.log(resp.usage.cachedInputTokens); // tokens served from cache
console.log(resp.usage.cacheCreationTokens); // tokens written to cache
Watch cached_input_tokens climb across the turns of a long conversation — that is the cache doing its job.
Managed caches with CacheManager
For providers with explicit managed caches (Gemini cachedContents), CacheManager is the lifecycle wrapper: create a named cache from a large context once, get back a handle/id with a TTL, and reference it on later requests via CachePolicy::Handle. Providers without managed-cache support surface an “unsupported” error from create / get / delete, and an empty list from list — so call supports() first.
from blazen import CacheManager, CacheCreateRequest, CachePolicy, Model, ModelRequest
mgr = CacheManager(Model.gemini())
if mgr.supports():
handle = await mgr.create(CacheCreateRequest(
model="gemini-2.5-flash",
messages=[...], # the large, reusable context to cache
ttl_seconds=3600,
))
# Reference the managed cache on later requests:
req = ModelRequest(messages=[...], cache=CachePolicy.handle(handle.id))
# ... use the handle across many requests, then clean up:
await mgr.delete(handle.id)
import { CacheManager, CachePolicy } from "blazen";
const mgr = new CacheManager(model);
if (mgr.supports()) {
const handle = await mgr.create(
{ model: "gemini-2.5-flash", ttlSeconds: 3600 },
[/* the large, reusable context messages */],
);
// reference it later via CachePolicy.handle(handle.id)
await mgr.delete(handle.id);
}
CacheManager exposes create, get, delete, and list (plus *_blocking siblings on the Go/Ruby UniFFI surface for synchronous callers). A managed cache is the right tool when a single large context — a document, a codebase, a knowledge base — is reused across many requests; automatic Auto caching covers the common case of a stable system-prompt + tool prefix without any explicit management.
See also
- Chat Window — the token-limited conversation buffer a
Botuses for memory - Memory — long-term semantic recall to pair with a bot’s short-term window
- Telemetry — track usage, cost, and cache-hit metrics across runs
- Distributed — run live workflows across multiple peers