Skip to main content

LLM Configuration

Configure AI providers for decision synthesis in self-hosted Align.

Overview

Align uses LLMs for:

  • Decision synthesis - Extracting structured decisions from conversations
  • Context understanding - Understanding the surrounding discussion
  • Embeddings - Semantic search across decisions

Provider Options

ProviderProsCons
OpenAIBest quality, easy setupData leaves your infra
AnthropicHigh quality, safety focusData leaves your infra
GPU InferenceFull sovereignty, flat cost, Align modelsGPU hardware required
CPU InferenceSovereignty, no GPU neededSlower, smaller models only

Option 1: OpenAI

Setup

  1. Get an API key from platform.openai.com

  2. Create the secret:

kubectl create secret generic align-llm \
--namespace align \
--from-literal=openai-api-key="sk-..."
  1. Configure in Helm values:
secrets:
llm:
openaiApiKey: "" # Pulled from secret

Models Used

  • GPT-4 - Decision synthesis
  • text-embedding-3-small - Embeddings (or local)

Option 2: Anthropic

Setup

  1. Get an API key from console.anthropic.com

  2. Create the secret:

kubectl create secret generic align-llm \
--namespace align \
--from-literal=anthropic-api-key="sk-ant-..."
  1. Configure in Helm values:
secrets:
llm:
anthropicApiKey: "" # Pulled from secret

Models Used

  • Claude 3 - Decision synthesis

Run Align's proprietary decision models or open-source LLMs on GPU nodes in your cluster. This provides the best combination of quality, speed, cost, and data sovereignty.

How It Works

Align includes a built-in vLLM deployment that runs on GPU nodes:

Brain Service ──► vLLM Server (GPU node) ──► Llama 8B / Align Decision Model

OpenAI-compatible API
(no code changes needed)

The Brain service automatically routes inference to the local vLLM server when LOCAL_LLM_SERVER_URL is set, with cloud APIs as fallback if the local server is unavailable.

Built-in GPU Deployment

Enable GPU inference in your Helm values:

gpu:
# NVIDIA device plugin (detects GPUs on nodes)
devicePlugin:
enabled: true

# vLLM inference server
llmServer:
enabled: true
image:
repository: vllm/vllm-openai
tag: "v0.8.5"
port: 8001
resources:
requests:
memory: "14Gi"
cpu: "2000m"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "4000m"
nvidia.com/gpu: "1"

The Helm chart automatically:

  • Deploys the NVIDIA device plugin DaemonSet on GPU nodes
  • Deploys the vLLM server with GPU resource requests
  • Injects LOCAL_LLM_SERVER_URL into Brain pods
  • Sets up health checks with generous timeouts (model loading takes 60-120s)

GPU Node Requirements

GPUVRAMModelsInstance (AWS)Cost
NVIDIA T416 GBLlama 8B, Mistral 7Bg4dn.xlarge~$380/mo
NVIDIA A10G24 GBLlama 13B, Mixtral 8x7Bg5.xlarge~$660/mo
NVIDIA A10040 GBLlama 70B (quantized)p4d.24xlarge~$7,000/mo

For most deployments, a single NVIDIA T4 running Llama 8B provides excellent quality for decision synthesis at a flat monthly cost.

GPU Node Setup (Kubernetes)

Your GPU nodes need:

  1. NVIDIA drivers installed (use GPU-optimized AMIs like al2023-nvidia@latest on EKS)
  2. Node label: node-type: gpu
  3. Taint: nvidia.com/gpu=true:NoSchedule (prevents non-GPU pods from scheduling)

Example for AWS EKS with Karpenter:

# Karpenter NodePool for GPU workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
template:
metadata:
labels:
node-type: gpu
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["g4dn.xlarge", "g4dn.2xlarge"]

Align Decision Models (Enterprise License)

Enterprise customers can run Align's proprietary fine-tuned models that are trained on decision patterns across tenants:

gpu:
llmServer:
enabled: true
image:
# Align's model server (includes fine-tuned weights)
repository: registry.align.tech/align/llm-server
tag: "v1.0.0"

Model weights are distributed via the same container registry used for all Align images. Your license JWT gates registry access - the same credentials that pull Brain/Gateway images also pull the model server. Models run entirely in your infrastructure with no phone-home.

Environment Variables

VariableDefaultDescription
LOCAL_LLM_SERVER_URL(none)URL of local vLLM server. Auto-set by Helm when gpu.llmServer.enabled
LOCAL_LLM_MODELmeta-llama/Llama-3.1-8B-InstructModel name served by vLLM
LOCAL_LLM_FOR_SCANS_ONLYtrueWhen true, only Discover scans use local GPU; synthesis stays on cloud API for quality. Set false to route all operations to local GPU

Routing Behavior

When LOCAL_LLM_SERVER_URL is set, routing depends on LOCAL_LLM_FOR_SCANS_ONLY:

Scan-only mode (LOCAL_LLM_FOR_SCANS_ONLY=true, default):

  • Decision synthesis (relationship detection, analysis) - routes to cloud API (Claude Sonnet/GPT-4o) for best quality
  • High-throughput operations (Discover scans, fast analysis) - routes to local GPU to save API costs. These are identified by internal force flags, which cover Discover historical scans and other bulk operations.
  • Fallback - if the local server is unreachable or returns errors, operations fall back to cloud APIs

This is the recommended mode: cloud APIs provide the best quality for customer-facing synthesis, while the local GPU handles high-volume operations at flat cost.

Full local mode (LOCAL_LLM_FOR_SCANS_ONLY=false):

  • All normal requests route to the local GPU server
  • Force-cloud operations bypass local and use cloud APIs
  • Fallback - if the local server is unreachable or returns errors, requests fall back to cloud APIs

Use full local mode for maximum data sovereignty or to eliminate cloud API costs entirely.


Option 4: Self-Hosted Models (CPU)

For deployments without GPU hardware, use CPU-based inference servers.

Supported Servers

Any server implementing the OpenAI API format:

ServerUse CaseSetup
OllamaEasy local deploymentollama serve
vLLMProduction GPU inferenceDocker/K8s
LocalAICPU-friendlyDocker
llama.cppGGUF models on CPUBinary/Docker

Quick Start with Ollama

  1. Deploy Ollama in your cluster:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: align
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: 1 # Optional: GPU
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: align
spec:
selector:
app: ollama
ports:
- port: 11434
  1. Pull a model:
kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b
  1. Configure Align:
# values.yaml
secrets:
llm:
custom:
baseUrl: "http://ollama.align.svc.cluster.local:11434/v1"
model: "llama3:70b"
apiKey: "" # Not needed for Ollama
useLocalEmbeddings: true

vLLM for Production

For high-throughput production use:

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: align
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3-70b-chat-hf"
- "--tensor-parallel-size"
- "4"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 4
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token

Configure Align:

secrets:
llm:
custom:
baseUrl: "http://vllm.align.svc.cluster.local:8000/v1"
model: "meta-llama/Llama-3-70b-chat-hf"

Embeddings

Local Embeddings (Default)

Align uses local sentence-transformers by default:

  • Model: all-MiniLM-L6-v2 (384 dimensions)
  • Cost: Free (runs locally in Brain pod)
  • Privacy: Data never leaves your cluster

Enable with:

secrets:
llm:
useLocalEmbeddings: true

OpenAI Embeddings

To use OpenAI embeddings instead:

secrets:
llm:
useLocalEmbeddings: false
openaiApiKey: "sk-..."

Discover Scan Tuning

The Discover feature scans connected tools (Slack, GitHub, Jira, Teams) for historical decisions. Align uses platform-specific batch sizes optimized for each connector type - no manual tuning is needed for most deployments.

How It Works

  • Batch sizes are automatically set per platform (Slack: 12, GitHub: 5, Jira: 8) to balance detection quality with speed
  • Confidence thresholds are platform-specific - messaging platforms (Slack, Teams) use lower thresholds to catch implicit decisions in threads
  • LLM token budgets scale dynamically with batch size - smaller batches get less output budget, reducing latency and cost
  • Event-driven completion - with Redis, scan progress updates are instant via pub/sub; in non-Kubernetes/local dev setups without Redis, a 15-second heartbeat provides reliable completion detection (Helm-based Kubernetes deployments require Redis)
  • GPU inference note - Discover scans use force flags (force_openai / force_anthropic) to bypass the local GPU server and use cloud APIs directly. This is because bulk scan operations need high throughput and rate limits that a single GPU can't match. Normal decision analysis (synthesis, relationship detection) uses the local GPU

Brain Service Configuration

VariableDefaultDescription
HISTORICAL_ANALYSIS_MODELgpt-4o-miniModel used for scanning historical items
PREFERRED_PROVIDERopenaiLLM provider: openai, anthropic, or custom
ANTHROPIC_MODELclaude-sonnet-4-20250514Model when using Anthropic provider

Recommendations by Provider

Local models (Ollama/vLLM):

brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "custom"
- name: HISTORICAL_ANALYSIS_MODEL
value: "llama3:70b" # Or your preferred model

Cloud APIs (OpenAI):

brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "openai"
# Uses gpt-4o-mini by default - best cost/quality ratio

Cloud APIs (Anthropic):

brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "anthropic"
- name: ANTHROPIC_MODEL
value: "claude-sonnet-4-20250514"

Gateway Scan Performance

For tuning scan parallelism and worker concurrency, see the Discover Tuning section in the Configuration Reference.

IMPORT_BATCH_SIZE is now a global override

Batch sizes are now platform-specific and optimized automatically. If IMPORT_BATCH_SIZE is set, it acts as a global override for all platforms. In most cases you can safely remove this variable; if you keep it, it will override all platform-specific defaults.


GPU Inference (vLLM)

TaskModelVRAMNotes
Decision synthesismeta-llama/Llama-3.1-8B-Instruct16 GBBest for T4 GPU, good quality
Decision synthesismistralai/Mistral-7B-Instruct-v0.314 GBStrong quality, fits T4
Decision synthesismeta-llama/Llama-3.1-70B-Instruct40 GB+Best local quality (needs A100)
Align Decision Modelalign/decision-v116 GBFine-tuned for decisions (Enterprise)

CPU Inference (Ollama / llama.cpp)

TaskModelRAMNotes
Decision synthesisllama3:8b8 GBGood quality, reasonable speed
Decision synthesismistral:7b8 GBGood balance
Decision synthesismixtral:8x7b26 GBStrong quality

Cloud APIs

TaskModelNotes
Decision synthesisclaude-sonnet-4Highest quality (cloud)
Decision synthesisgpt-4o-miniBest cost/quality ratio (cloud)
Historical scanninggpt-4o-miniBest for bulk scans (high rate limits)
Historical scanningclaude-haiku-4-5Cheap Anthropic option for scans
Embeddingsall-MiniLM-L6-v2Local, free, runs on CPU in Brain pod

Configuration via UI

You can also configure LLM settings in the Align UI:

  1. Go to SettingsLLM Settings
  2. Select provider
  3. Enter credentials
  4. Save

UI-configured settings are stored encrypted in the database and take precedence over Helm values.


Troubleshooting

Connection refused

Ensure the LLM server is accessible from the Brain pod:

kubectl exec -it deploy/align-brain -n align -- \
curl http://ollama:11434/v1/models

Slow responses

  • Use GPU acceleration (vLLM recommended for production)
  • Reduce model size (8B instead of 70B)
  • Increase Brain pod resources

Model not found

# For Ollama, pull the model first
kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b

JSON mode issues

Some local models don't support JSON mode reliably. Consider:

  • Using models fine-tuned for structured output
  • Falling back to OpenAI/Anthropic for critical tasks

Security

  • API keys are encrypted at rest in the database (AES-256 via ALIGN_MASTER_ENCRYPTION_KEY)
  • Self-hosted models (GPU or CPU) keep all data in your cluster - no inference data leaves your infrastructure
  • Align's proprietary model weights are distributed via authenticated container registry pulls, gated by your license JWT
  • Model weights run entirely locally after pull - no phone-home or telemetry during inference
  • Use Kubernetes NetworkPolicies to restrict LLM server access to Brain pods only
  • GPU nodes should use dedicated taints (nvidia.com/gpu) to prevent non-GPU workloads from scheduling
  • For air-gapped environments, mirror the model server image to your internal registry