LLM

Call chat completion and embedding models in Kotlin

LLM

CompletionModel is the chat-completion handle and EmbeddingModel is the embedding handle. Both are constructed via free-function factories from dev.zorpx.blazen.uniffi and accept the same idiomatic call shapes across every provider Blazen supports.

Picking a provider

The package exposes one new*CompletionModel / new*EmbeddingModel free function per supported backend. Each cloud factory takes an API key (pass the empty string to fall back to the provider’s well-known environment variable, like OPENAI_API_KEY), an optional model id, and an optional baseUrl override for proxies and self-hosted gateways:

import dev.zorpx.blazen.uniffi.newAnthropicCompletionModel
import dev.zorpx.blazen.uniffi.newGeminiCompletionModel
import dev.zorpx.blazen.uniffi.newGroqCompletionModel
import dev.zorpx.blazen.uniffi.newOpenaiCompletionModel

val openai    = newOpenaiCompletionModel(apiKey = "", model = "gpt-4o-mini", baseUrl = null)
val anthropic = newAnthropicCompletionModel(apiKey = "", model = "claude-3-5-sonnet-latest", baseUrl = null)
val gemini    = newGeminiCompletionModel(apiKey = "", model = "gemini-1.5-flash", baseUrl = null)
val groq      = newGroqCompletionModel(apiKey = "", model = "llama-3.1-70b-versatile", baseUrl = null)

The full list covers OpenAI, Anthropic, Gemini, Azure OpenAI, AWS Bedrock, OpenRouter, Groq, Together, Mistral, DeepSeek, Fireworks, Perplexity, xAI, Cohere, fal.ai, and a generic newOpenaiCompatCompletionModel(...) factory for anything that speaks the standard OpenAI Chat Completions wire format (vLLM, llama-server, LM Studio, …).

Each handle implements AutoCloseable — use model.use { } (or try { ... } finally { model.close() }) to release the native handle when you’re done.

Building a request

CompletionRequest carries the message list plus the usual knobs. Every named argument except messages is nullable — pass null to use the provider’s default:

import dev.zorpx.blazen.uniffi.ChatMessage
import dev.zorpx.blazen.uniffi.CompletionRequest

val request = CompletionRequest(
    messages = listOf(
        ChatMessage(role = "system", content = "You are a helpful assistant.",
                    mediaParts = emptyList(), toolCalls = emptyList(),
                    toolCallId = null, name = null),
        ChatMessage(role = "user", content = "What is the capital of France?",
                    mediaParts = emptyList(), toolCalls = emptyList(),
                    toolCallId = null, name = null),
    ),
    tools = emptyList(),
    temperature = 0.7,
    maxTokens = 256u,
    topP = null,
    model = null,
    responseFormatJson = null,
    system = null,
)

maxTokens and topP use unsigned types (UInt? and Double?); the u suffix marks the unsigned literal.

Building messages

ChatMessage is a data class with one role-agnostic constructor. The mediaParts and toolCalls lists default to empty only in the hand-written wrapper layer — when constructing the generated dev.zorpx.blazen.uniffi.ChatMessage directly, you must pass emptyList() explicitly:

// User message
ChatMessage(role = "user", content = "Hello!",
            mediaParts = emptyList(), toolCalls = emptyList(),
            toolCallId = null, name = null)

// Tool response (carries toolCallId so the model can correlate)
ChatMessage(role = "tool", content = "temperature=22C",
            mediaParts = emptyList(), toolCalls = emptyList(),
            toolCallId = "call_abc123", name = null)

// Multimodal user message (see the Multimodal guide)
ChatMessage(role = "user", content = "What does this image show?",
            mediaParts = listOf(imagePart), toolCalls = emptyList(),
            toolCallId = null, name = null)

Non-streaming completions

CompletionModel.complete(request) is a suspend fun that returns a CompletionResponse:

import kotlinx.coroutines.runBlocking

fun main() = runBlocking {
    val model = newOpenaiCompletionModel(apiKey = "", model = "gpt-4o-mini", baseUrl = null)
    model.use {
        val response = it.complete(request)
        println(response.content)
        println("tokens: in=${response.usage.promptTokens} out=${response.usage.completionTokens}")
        println("finish: ${response.finishReason}")
    }
}

The response carries content: String (the assistant’s text reply), toolCalls: List<ToolCall> (any tool calls the model emitted), finishReason: String, model: String, and usage: TokenUsage.

For callers in non-suspend contexts (a Spring controller, an old-style Runnable, a quick CLI script), completeBlocking(request) blocks the current thread on the shared Tokio runtime and returns the same CompletionResponse.

Streaming completions

completeStreaming(model, request, sink) is a top-level suspend fun that drives the stream through a CompletionStreamSink callback. The most ergonomic way to consume one is to wrap it in a Flow<StreamChunk> via callbackFlow — see the Streaming guide for the full pattern. The short version:

model.streamChunks(request).collect { chunk ->
    print(chunk.contentDelta)
}

Cancelling the collecting coroutine tears the stream down and stops the underlying provider connection.

Tool calls

A request with tools declared lets the model emit tool invocations. Each ToolCall in the response carries an id, the name of the tool, and argumentsJson — a JSON string the model emitted. Hand that back to your tool, then feed the result into a follow-up tool-role message:

import dev.zorpx.blazen.uniffi.Tool
import dev.zorpx.blazen.uniffi.ToolCall

val weatherTool = Tool(
    name = "get_weather",
    description = "Get the current weather for a city.",
    parametersJson = """{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}""",
)

val messages = mutableListOf(
    ChatMessage(role = "user", content = "Weather in Paris?",
                mediaParts = emptyList(), toolCalls = emptyList(),
                toolCallId = null, name = null),
)
val first = model.complete(CompletionRequest(
    messages = messages, tools = listOf(weatherTool),
    temperature = null, maxTokens = null, topP = null,
    model = null, responseFormatJson = null, system = null,
))

// Append the assistant message (mirroring the upstream protocol).
messages.add(ChatMessage(
    role = "assistant", content = first.content,
    mediaParts = emptyList(), toolCalls = first.toolCalls,
    toolCallId = null, name = null,
))

for (call in first.toolCalls) {
    val result = runWeatherTool(call.argumentsJson)
    messages.add(ChatMessage(
        role = "tool", content = result,
        mediaParts = emptyList(), toolCalls = emptyList(),
        toolCallId = call.id, name = null,
    ))
}

val second = model.complete(CompletionRequest(
    messages = messages, tools = listOf(weatherTool),
    temperature = null, maxTokens = null, topP = null,
    model = null, responseFormatJson = null, system = null,
))
println(second.content)

For the full tool-call loop with iteration budgets and error handling, use the Agent type instead of orchestrating turns by hand.

Embeddings

EmbeddingModel.embed(inputs) returns one vector per input string. Build a model via newOpenaiEmbeddingModel(...) for OpenAI, newFalEmbeddingModel(...) for fal.ai, or one of the local factories (newFastembedEmbeddingModel(...), newTractEmbeddingModel(...), newCandleEmbeddingModel(...)):

import dev.zorpx.blazen.uniffi.newOpenaiEmbeddingModel

val embedder = newOpenaiEmbeddingModel(
    apiKey = "",
    model = "text-embedding-3-small",
    baseUrl = null,
)
embedder.use {
    val response = it.embed(listOf(
        "blazen orchestrates LLM workflows",
        "kotlin loves coroutines",
    ))
    for ((input, vector) in listOf("A" to response.embeddings[0], "B" to response.embeddings[1])) {
        println("$input -> ${vector.size} dims, first=${vector.first()}")
    }
    println("model: ${it.modelId()}, dimensions: ${it.dimensions()}")
}

modelId() and dimensions() are method calls (not properties) because the generated UniFFI surface exposes them as fun rather than val.

Local backends

For local inference, several feature-gated factories ship in the same namespace:

import dev.zorpx.blazen.uniffi.newCandleCompletionModel
import dev.zorpx.blazen.uniffi.newLlamacppCompletionModel
import dev.zorpx.blazen.uniffi.newMistralrsCompletionModel

val llama = newLlamacppCompletionModel(
    modelPath = "/models/llama-3.2-1b-q4_k_m.gguf",
    device = "cuda:0",
    quantization = null,
    contextLength = 8192u,
    nGpuLayers = 99u,
)

val mistralrs = newMistralrsCompletionModel(
    modelId = "mistralai/Mistral-7B-Instruct-v0.3",
    device = null,
    quantization = null,
    contextLength = null,
    vision = false,
)

val candle = newCandleCompletionModel(
    modelId = "meta-llama/Llama-3.2-1B-Instruct",
    device = null,
    quantization = null,
    revision = null,
    contextLength = null,
)

Each requires the underlying native lib to be built with the corresponding feature flag (llamacpp, mistralrs, candle-llm). The prebuilt artefact ships with the cloud factories always enabled; check the release notes for the per-platform local-backend matrix.

Errors

Every failure surfaces as a BlazenException (sealed class). Switch on the variant for typed handling, or just read .message for a uniform String:

import dev.zorpx.blazen.uniffi.BlazenException

try {
    val response = model.complete(request)
} catch (e: BlazenException) {
    when (e) {
        is BlazenException.Auth          -> println("bad credentials: ${e.message}")
        is BlazenException.RateLimit     -> println("rate limited: ${e.message}")
        is BlazenException.Timeout       -> println("timed out: ${e.message}")
        is BlazenException.Validation    -> println("bad input: ${e.message}")
        is BlazenException.ContentPolicy -> println("blocked: ${e.message}")
        is BlazenException.Provider      -> println("provider ${e.provider} (${e.kind}): ${e.message}")
        else                              -> println("completion failed: ${e.message}")
    }
}

See the Context guide for the full list of variants and their structured fields.

LLM

LLM