Control Plane

Schedule workflows across a fleet of workers with a central gRPC control plane

The Blazen Control Plane (blazen-controlplane) is a central gRPC server that a fleet of workers connect into. It owns the authoritative view of running workflows, the connected-worker registry, and the assignment queue. Orchestrators submit workflows to it; the control plane decides which worker runs each one and streams the results back.

Where Distributed Workflows (blazen-peer) is a flat mesh of equal peers that dial each other directly, the control plane is hub-and-spoke: workers always open the connection outbound, so only the control plane needs a reachable address. That makes it NAT-friendly and a natural fit for a pool of GPU boxes phoning home to a coordinator.

:::note[Control Plane vs. ModelManager] The control plane orchestrates workflows across workers. It does not centrally manage LLM provider instances or their credentials. Managing named provider instances inside one process is the job of the ModelManager — see Custom Providers. Each worker holds its own provider credentials locally. :::

When to use it

  • You have a pool of workers (GPU machines, agent runners, transcoders) and want to submit work to the pool without knowing which node will run it.
  • You need capability-based routing — “run this on a worker that can do inference:vision:depth” — instead of addressing a specific host.
  • You want a NAT-friendly topology where workers dial in and the coordinator stays put.
  • You need priority scheduling, draining, idempotent submits, or human-in-the-loop input during a run.

If you just want one workflow to delegate a sub-step to one other machine, blazen-peer is simpler. Reach for the control plane when you’re scheduling across many workers.

The worker / orchestrator model

Two kinds of clients connect to one server:

  orchestrator ──┐  submit / cancel / describe / subscribe /
                 │  list-workers / drain / respond-to-input

        ControlPlaneServer
          (registry · queue · admission)

  worker ────────┘  opens a bidi session: Hello → Welcome,
                    then heartbeats, results, and events
  • Server (ControlPlaneServer) — the hub. Keeps the worker registry, the per-capability assignment queue (priority WFQ scheduler), and the admission/routing logic. It does not run workflows itself.
  • Worker (Worker + AssignmentHandler) — dials in over a bidirectional stream, advertises its capabilities, tags, labels, and taints, and runs assignments it’s given.
  • Orchestrator (Client, implementing OrchestratorClient) — submits workflow runs and subscribes to their events.

Run a server

use blazen_controlplane::ControlPlaneServer;

let server = ControlPlaneServer::new("cp-node-1");
// .with_tls(tls_config)   // optional mTLS
// .with_store(arc_store)  // optional durable AssignmentStore (Valkey)
server.serve("0.0.0.0:7445".parse()?).await?;

Connect a worker

A worker pairs a WorkerConfig (what it can do, how it authenticates, how its capacity is bounded) with an AssignmentHandler (what it actually does with an assignment):

use blazen_controlplane::{Worker, WorkerConfig, AssignmentHandler,
    AssignmentContext, AssignmentFailure};
use blazen_controlplane::protocol::Assignment;
use blazen_core::distributed::{AdmissionMode, WorkerCapability};
use async_trait::async_trait;
use serde_json::Value;

struct MyHandler;

#[async_trait]
impl AssignmentHandler for MyHandler {
    async fn handle(&self, assignment: Assignment, ctx: AssignmentContext)
        -> Result<Value, AssignmentFailure>
    {
        // Run the workflow. Optionally emit progress with ctx.emit_event(...)
        // or pause for input with ctx.request_input(...).
        Ok(serde_json::json!({ "ok": true }))
    }
}

let config = WorkerConfig::new("http://cp-host:7445", "worker-a")
    .with_capability(WorkerCapability { kind: "inference:llm:ollama".into(), version: 1 })
    .with_tag("region", "us-west")
    .with_admission(AdmissionMode::Fixed { max_in_flight: 4 });

Worker::connect(config)?.run(MyHandler).await?;

The worker reconnects automatically under its configured retry policy if the connection drops.

Drive it from an orchestrator

Client implements blazen_core::distributed::OrchestratorClient:

use blazen_controlplane::Client;
use blazen_core::distributed::{OrchestratorClient, SubmitWorkflowRequest};

let client = Client::connect("http://cp-host:7445", None /* tls */, None /* bearer */).await?;

let snapshot = client.submit_workflow(SubmitWorkflowRequest {
    workflow_name: "summarize".into(),
    workflow_version: None,
    input: serde_json::json!({ "url": "https://example.com" }),
    required_tags: vec!["region=us-west".into()],
    idempotency_key: Some("dedupe-1".into()),
    deadline_ms: Some(30_000),
    wait_for_worker: true,
    resource_hint: None,
}).await?;

// Also available: cancel_workflow, describe_workflow, subscribe_run_events,
// list_workers — plus respond_to_input and drain_worker on Client directly.

Capability-based routing

Every assignment names a required capability { kind, version }. The control plane routes it through a stateless filter pipeline:

  1. Capability match — only workers advertising the exact kind@version.
  2. Tag predicate — drop workers whose tags don’t AND-match every key=value (or key=*) requirement.
  3. Node selector + taints/tolerationsrequired labels must all match, forbidden none, preferred adds to the tie-break score; un-tolerated NoSchedule taints exclude a worker, PreferNoSchedule only penalizes it.
  4. Capacity filter — drop workers whose per-worker admission strategy is full.
  5. Tie-break — least-loaded remaining worker (round-robin on ties), with a bonus for workers that already have the requested model resident.

The kind strings (inference:llm:*, inference:vision:*, media:*, agent:*, …) and the node-label / taint / priority conventions are the authoritative registry in the Capability Namespaces document — pick kinds from there rather than inventing strings.

Per-worker admission strategy

Each worker declares how its capacity is bounded:

StrategyMeaning
Fixed { max_in_flight }Hard count cap. Best for fungible CPU work.
VramBudget { max_vram_mb }VRAM-sum cap; assignments must carry a resource_hint.vram_mb.
ReactiveWorker self-decides via offer / claim / decline.

For Reactive workers the server sends an offer; the worker replies Claim or Decline { reason }, and the queue re-routes on decline.

Priority and fairness

Assignments carry a priority: u8 (lower = served first; default 128). The queue runs a deficit-round-robin weighted-fair-queueing scheduler across 8 priority bands (width 32, weights stepping 256 → 32), so high-priority work is favored but low-priority work never starves.

Authentication

The control plane uses the same auth surface as blazen-peer:

  • Bearer token — a shared secret sent as authorization: Bearer <token> metadata on every request. Read from BLAZEN_PEER_TOKEN, or supplied explicitly: orchestrators pass it to Client::connect, workers via WorkerConfig::with_bearer_token. The server verifies it on every RPC with a constant-time compare. If no token is configured server-side, auth is off — intended for local / loopback development only.
  • mTLS — configure the server with with_tls and clients with a ClientTlsConfig (or the with_mtls PEM-loading helpers). Recommended for production fleets.

The crate also describes planned enrollment policies — Open, Allowlist, and Signed-handshake — but only the bearer + mTLS surface above is enforced today.

Human-in-the-loop input requests

A running assignment can pause, ask for input, and resume with the answer — useful for approvals, clarifying questions, or any human-in-the-loop step.

  1. The worker calls ctx.request_input(prompt, metadata, timeout_ms). This emits an "input.request" event carrying { request_id, prompt, metadata } and blocks.
  2. The control plane fans that event out to subscribers (subscribe_run_events / subscribe_all), so an orchestrator — or a UI behind it — sees the request.
  3. The orchestrator answers with client.respond_to_input(run_id, request_id, response).
  4. The control plane delivers the answer to the worker assigned that run, and request_input returns the JSON value to the handler so the workflow continues.

The call fails if the run is cancelled, the worker disconnects, or timeout_ms elapses with no answer.

Transports and durability

  • gRPC (default) over HTTP/2, with optional mTLS.
  • HTTP/SSE bridge (http-transport feature) for clients that can’t speak HTTP/2 — browsers and some serverless / wasi targets.
  • OpenAI-compatible REST (http-rest feature) plus Blazen-admin endpoints.
  • Durable assignment queue (valkey-store feature) so queued and in-flight work survives a control-plane restart; the default store is in-memory.