LLM
Provider factories, chat completion, and embeddings in Go
The Go binding exposes Blazen’s LLM surface through two opaque handles — *blazen.CompletionModel for chat models and *blazen.EmbeddingModel for embedding models — plus a flat set of provider factory functions that mint them. Both handles own native resources; call Close() when finished (or rely on the finalizer as a safety net).
Provider factories
Every supported provider has a New<Name><Surface> factory. The naming follows Go conventions for acronyms (OpenAI, XAI, LlamaCpp, DeepSeek) rather than UniFFI’s PascalCase-of-snake.
HTTP completion providers
Each of these accepts (apiKey, model, baseURL), where model and baseURL use the empty-string-means-unset convention:
model, err := blazen.NewOpenAICompletion(os.Getenv("OPENAI_API_KEY"), "gpt-4o", "")
The factory list, all with the same (apiKey, model, baseURL) shape:
NewOpenAICompletionNewAnthropicCompletionNewGeminiCompletionNewMistralCompletionNewCohereCompletionNewDeepSeekCompletionNewFireworksCompletionNewGroqCompletionNewOpenRouterCompletionNewPerplexityCompletionNewTogetherCompletionNewXAICompletion
Two providers take additional positional arguments because their request topology is different:
NewAzureCompletion(apiKey, resourceName, deploymentName, apiVersion)— Azure OpenAI deployments are named per-resource.NewBedrockCompletion(apiKey, region, model, baseURL)— AWS Bedrock is region-scoped.NewOpenAICompatCompletion(providerName, baseURL, apiKey, model)— a generic OpenAI-compatible factory for Ollama, vLLM, LM Studio, LocalAI, and similar. All four arguments are required; pass a placeholderapiKeywhen the upstream does not require one.
Local-runtime completion providers
Three local-runtime factories take an options struct so the various optional knobs stay readable. All use the empty-string-means-unset and zero-means-default conventions for optional fields:
// Candle (CPU / CUDA / Metal via the Candle runtime)
model, err := blazen.NewCandleCompletion(blazen.CandleCompletionOpts{
ModelID: "meta-llama/Llama-3-8B",
Device: "cuda:0",
Quantization: "q4_0",
ContextLength: 4096,
})
// llama.cpp (GGUF files)
model, err = blazen.NewLlamaCppCompletion(blazen.LlamaCppCompletionOpts{
ModelPath: "/models/llama3-8b-instruct.Q4_K_M.gguf",
ContextLength: 8192,
NGpuLayers: 32,
})
// mistral.rs (multimodal-capable local runtime)
model, err = blazen.NewMistralRsCompletion(blazen.MistralRsCompletionOpts{
ModelID: "mistralai/Mistral-7B-Instruct-v0.3",
Vision: false,
})
Fal.ai’s chat surface uses its own options struct because of the enterprise + auto-route toggles:
model, err := blazen.NewFalCompletion(blazen.FalCompletionOpts{
APIKey: os.Getenv("FAL_KEY"),
Model: "fal-ai/any-llm/openai/gpt-4o",
})
Embedding providers
The embedding factories follow the same conventions:
// HTTP
emb, err := blazen.NewOpenAIEmbedding(os.Getenv("OPENAI_API_KEY"), "text-embedding-3-small", "")
// Fal.ai (Matryoshka-style truncation supported)
emb, err = blazen.NewFalEmbedding(blazen.FalEmbeddingOpts{
APIKey: os.Getenv("FAL_KEY"),
Dimensions: 256,
})
// Local: Candle, FastEmbed (ONNX), Tract (pure-Rust ONNX)
emb, err = blazen.NewCandleEmbedding(blazen.CandleEmbeddingOpts{
ModelID: "sentence-transformers/all-MiniLM-L6-v2",
})
emb, err = blazen.NewFastEmbedEmbedding(blazen.FastEmbedEmbeddingOpts{})
emb, err = blazen.NewTractEmbedding(blazen.TractEmbeddingOpts{})
CompletionRequest
blazen.CompletionRequest is the provider-agnostic chat request shape:
type CompletionRequest struct {
Messages []ChatMessage
Tools []Tool
Temperature *float64 // nil = provider default
MaxTokens *uint32 // nil = provider default
TopP *float64 // nil = provider default
Model string // "" = use the model bound at factory time
ResponseFormatJSON string // "" = no schema constraint
System string // "" = no system message
}
The pointer-typed numeric fields exist so callers can distinguish “explicitly zero” from “unset” — pass nil to defer to the provider. The string fields use the empty-string-means-unset convention.
ResponseFormatJSON, when set, is a JSON Schema string constraining the model’s output (the provider’s structured-output / JSON-mode feature). System, when set, is prepended as a system-role message before Messages.
temp := 0.2
maxTok := uint32(512)
req := blazen.CompletionRequest{
System: "You are a concise assistant.",
Messages: []blazen.ChatMessage{{Role: "user", Content: "Hello"}},
Temperature: &temp,
MaxTokens: &maxTok,
}
Complete and CompleteBlocking
Every CompletionModel exposes two completion entry points:
// Cancellable: launches the FFI call on a background goroutine and
// returns ctx.Err() if ctx fires first.
func (*CompletionModel) Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error)
// Synchronous: blocks the calling goroutine until the provider responds.
func (*CompletionModel) CompleteBlocking(req CompletionRequest) (*CompletionResponse, error)
Prefer Complete in long-running services where cancellation matters. Use CompleteBlocking in short scripts or one-shot main functions where the async wiring is overkill — it has no select dance, no background goroutine, and exactly one error return path.
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
res, err := model.Complete(ctx, blazen.CompletionRequest{
Messages: []blazen.ChatMessage{{Role: "user", Content: "Pi to 5 digits?"}},
})
if err != nil {
return err
}
fmt.Println(res.Content) // "3.14159"
fmt.Println(res.FinishReason) // "stop"
fmt.Println(res.Usage.TotalTokens) // aggregated tokens
See the Context guide for the cancellation propagation caveat — returning ctx.Err() unblocks the Go caller, but the Rust-side request continues until it finishes naturally.
CompletionResponse
type CompletionResponse struct {
Content string // empty when the model emitted only tool calls
ToolCalls []ToolCall // model-requested tool invocations
FinishReason string // "stop", "tool_calls", "length", or ""
Model string // provider-reported model id
Usage TokenUsage
}
Content is the empty string when the response carried only tool calls — dispatch on len(res.ToolCalls) > 0 rather than res.Content != "" to detect that case. FinishReason is the empty string when the provider did not report one.
Embeddings
EmbeddingModel.Embed and EmbeddingModel.EmbedBlocking are the same pattern at a different signature:
func (*EmbeddingModel) Embed(ctx context.Context, inputs []string) (*EmbeddingResponse, error)
func (*EmbeddingModel) EmbedBlocking(inputs []string) (*EmbeddingResponse, error)
emb, _ := blazen.NewOpenAIEmbedding(os.Getenv("OPENAI_API_KEY"), "text-embedding-3-small", "")
defer emb.Close()
res, err := emb.Embed(ctx, []string{"hello", "world", "blazen"})
if err != nil {
return err
}
for i, vec := range res.Embeddings {
fmt.Printf("input[%d] -> %d-dim vector\n", i, len(vec))
}
fmt.Println(emb.Dimensions()) // model's native dimensionality
The returned EmbeddingResponse.Embeddings is a [][]float64 — one vector per input string, in input order. Dimensions() reports the model’s vector size as a uint32, useful for sizing a downstream vector store.
Streaming
For incremental token delivery, use blazen.Stream(ctx, model, req). The streaming surface is documented in detail on the Streaming page.
Lifecycle
Every model handle implements Close(). The call is idempotent and safe from multiple goroutines:
model, err := blazen.NewOpenAICompletion(apiKey, "gpt-4o", "")
if err != nil {
return err
}
defer model.Close()
A runtime.SetFinalizer is attached as a safety net, but explicit Close() is preferred for predictable resource release — particularly when you are mid-stream (see the streaming guide’s “Lifetime” section).
See also
- Streaming — channel-based streaming completions.
- Agent — LLM tool-call loop built on
CompletionModel. - Multimodal — attaching images, audio, and video to chat messages.
- Context — cancellation semantics on
CompleteandEmbed.