LLM Configuration
Configure AI providers for decision synthesis in self-hosted Align.
Overview
Align uses LLMs for:
- Decision synthesis - Extracting structured decisions from conversations
- Context understanding - Understanding the surrounding discussion
- Embeddings - Semantic search across decisions
Provider Options
| Provider | Pros | Cons |
|---|---|---|
| OpenAI | Best quality, easy setup | Data leaves your infra |
| Anthropic | High quality, safety focus | Data leaves your infra |
| GPU Inference | Full sovereignty, flat cost, Align models | GPU hardware required |
| CPU Inference | Sovereignty, no GPU needed | Slower, smaller models only |
Option 1: OpenAI
Setup
-
Get an API key from platform.openai.com
-
Create the secret:
kubectl create secret generic align-llm \
--namespace align \
--from-literal=openai-api-key="sk-..."
- Configure in Helm values:
secrets:
llm:
openaiApiKey: "" # Pulled from secret
Models Used
- GPT-4 - Decision synthesis
- text-embedding-3-small - Embeddings (or local)
Option 2: Anthropic
Setup
-
Get an API key from console.anthropic.com
-
Create the secret:
kubectl create secret generic align-llm \
--namespace align \
--from-literal=anthropic-api-key="sk-ant-..."
- Configure in Helm values:
secrets:
llm:
anthropicApiKey: "" # Pulled from secret
Models Used
- Claude 3 - Decision synthesis
Option 3: GPU Inference (Recommended for Enterprise)
Run Align's proprietary decision models or open-source LLMs on GPU nodes in your cluster. This provides the best combination of quality, speed, cost, and data sovereignty.
How It Works
Align includes a built-in vLLM deployment that runs on GPU nodes:
Brain Service ──► vLLM Server (GPU node) ──► Llama 8B / Align Decision Model
│
OpenAI-compatible API
(no code changes needed)
The Brain service automatically routes inference to the local vLLM server when LOCAL_LLM_SERVER_URL is set, with cloud APIs as fallback if the local server is unavailable.
Built-in GPU Deployment
Enable GPU inference in your Helm values:
gpu:
# NVIDIA device plugin (detects GPUs on nodes)
devicePlugin:
enabled: true
# vLLM inference server
llmServer:
enabled: true
image:
repository: vllm/vllm-openai
tag: "v0.8.5"
port: 8001
resources:
requests:
memory: "14Gi"
cpu: "2000m"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
The Helm chart automatically:
- Deploys the NVIDIA device plugin DaemonSet on GPU nodes
- Deploys the vLLM server with GPU resource requests
- Injects
LOCAL_LLM_SERVER_URLinto Brain pods - Sets up health checks with generous timeouts (model loading takes 60-120s)
GPU Node Requirements
| GPU | VRAM | Models | Instance (AWS) | Cost |
|---|---|---|---|---|
| NVIDIA T4 | 16 GB | Llama 8B, Mistral 7B | g4dn.xlarge | ~$380/mo |
| NVIDIA A10G | 24 GB | Llama 13B, Mixtral 8x7B | g5.xlarge | ~$660/mo |
| NVIDIA A100 | 40 GB | Llama 70B (quantized) | p4d.24xlarge | ~$7,000/mo |
For most deployments, a single NVIDIA T4 running Llama 8B provides excellent quality for decision synthesis at a flat monthly cost.
GPU Node Setup (Kubernetes)
Your GPU nodes need:
- NVIDIA drivers installed (use GPU-optimized AMIs like
al2023-nvidia@lateston EKS) - Node label:
node-type: gpu - Taint:
nvidia.com/gpu=true:NoSchedule(prevents non-GPU pods from scheduling)
Example for AWS EKS with Karpenter:
# Karpenter NodePool for GPU workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
template:
metadata:
labels:
node-type: gpu
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["g4dn.xlarge", "g4dn.2xlarge"]
Align Decision Models (Enterprise License)
Enterprise customers can run Align's proprietary fine-tuned models that are trained on decision patterns across tenants:
gpu:
llmServer:
enabled: true
image:
# Align's model server (includes fine-tuned weights)
repository: registry.align.tech/align/llm-server
tag: "v1.0.0"
Model weights are distributed via the same container registry used for all Align images. Your license JWT gates registry access - the same credentials that pull Brain/Gateway images also pull the model server. Models run entirely in your infrastructure with no phone-home.
Environment Variables
| Variable | Default | Description |
|---|---|---|
LOCAL_LLM_SERVER_URL | (none) | URL of local vLLM server. Auto-set by Helm when gpu.llmServer.enabled |
LOCAL_LLM_MODEL | meta-llama/Llama-3.1-8B-Instruct | Model name served by vLLM |
LOCAL_LLM_FOR_SCANS_ONLY | true | When true, only Discover scans use local GPU; synthesis stays on cloud API for quality. Set false to route all operations to local GPU |
Routing Behavior
When LOCAL_LLM_SERVER_URL is set, routing depends on LOCAL_LLM_FOR_SCANS_ONLY:
Scan-only mode (LOCAL_LLM_FOR_SCANS_ONLY=true, default):
- Decision synthesis (relationship detection, analysis) - routes to cloud API (Claude Sonnet/GPT-4o) for best quality
- High-throughput operations (Discover scans, fast analysis) - routes to local GPU to save API costs. These are identified by internal force flags, which cover Discover historical scans and other bulk operations.
- Fallback - if the local server is unreachable or returns errors, operations fall back to cloud APIs
This is the recommended mode: cloud APIs provide the best quality for customer-facing synthesis, while the local GPU handles high-volume operations at flat cost.
Full local mode (LOCAL_LLM_FOR_SCANS_ONLY=false):
- All normal requests route to the local GPU server
- Force-cloud operations bypass local and use cloud APIs
- Fallback - if the local server is unreachable or returns errors, requests fall back to cloud APIs
Use full local mode for maximum data sovereignty or to eliminate cloud API costs entirely.
Option 4: Self-Hosted Models (CPU)
For deployments without GPU hardware, use CPU-based inference servers.
Supported Servers
Any server implementing the OpenAI API format:
| Server | Use Case | Setup |
|---|---|---|
| Ollama | Easy local deployment | ollama serve |
| vLLM | Production GPU inference | Docker/K8s |
| LocalAI | CPU-friendly | Docker |
| llama.cpp | GGUF models on CPU | Binary/Docker |
Quick Start with Ollama
- Deploy Ollama in your cluster:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: align
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: 1 # Optional: GPU
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: align
spec:
selector:
app: ollama
ports:
- port: 11434
- Pull a model:
kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b
- Configure Align:
# values.yaml
secrets:
llm:
custom:
baseUrl: "http://ollama.align.svc.cluster.local:11434/v1"
model: "llama3:70b"
apiKey: "" # Not needed for Ollama
useLocalEmbeddings: true
vLLM for Production
For high-throughput production use:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: align
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3-70b-chat-hf"
- "--tensor-parallel-size"
- "4"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 4
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
Configure Align:
secrets:
llm:
custom:
baseUrl: "http://vllm.align.svc.cluster.local:8000/v1"
model: "meta-llama/Llama-3-70b-chat-hf"
Embeddings
Local Embeddings (Default)
Align uses local sentence-transformers by default:
- Model:
all-MiniLM-L6-v2(384 dimensions) - Cost: Free (runs locally in Brain pod)
- Privacy: Data never leaves your cluster
Enable with:
secrets:
llm:
useLocalEmbeddings: true
OpenAI Embeddings
To use OpenAI embeddings instead:
secrets:
llm:
useLocalEmbeddings: false
openaiApiKey: "sk-..."
Discover Scan Tuning
The Discover feature scans connected tools (Slack, GitHub, Jira, Teams) for historical decisions. Align uses platform-specific batch sizes optimized for each connector type - no manual tuning is needed for most deployments.
How It Works
- Batch sizes are automatically set per platform (Slack: 12, GitHub: 5, Jira: 8) to balance detection quality with speed
- Confidence thresholds are platform-specific - messaging platforms (Slack, Teams) use lower thresholds to catch implicit decisions in threads
- LLM token budgets scale dynamically with batch size - smaller batches get less output budget, reducing latency and cost
- Event-driven completion - with Redis, scan progress updates are instant via pub/sub; in non-Kubernetes/local dev setups without Redis, a 15-second heartbeat provides reliable completion detection (Helm-based Kubernetes deployments require Redis)
- GPU inference note - Discover scans use force flags (
force_openai/force_anthropic) to bypass the local GPU server and use cloud APIs directly. This is because bulk scan operations need high throughput and rate limits that a single GPU can't match. Normal decision analysis (synthesis, relationship detection) uses the local GPU
Brain Service Configuration
| Variable | Default | Description |
|---|---|---|
HISTORICAL_ANALYSIS_MODEL | gpt-4o-mini | Model used for scanning historical items |
PREFERRED_PROVIDER | openai | LLM provider: openai, anthropic, or custom |
ANTHROPIC_MODEL | claude-sonnet-4-20250514 | Model when using Anthropic provider |
Recommendations by Provider
Local models (Ollama/vLLM):
brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "custom"
- name: HISTORICAL_ANALYSIS_MODEL
value: "llama3:70b" # Or your preferred model
Cloud APIs (OpenAI):
brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "openai"
# Uses gpt-4o-mini by default - best cost/quality ratio
Cloud APIs (Anthropic):
brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "anthropic"
- name: ANTHROPIC_MODEL
value: "claude-sonnet-4-20250514"
Gateway Scan Performance
For tuning scan parallelism and worker concurrency, see the Discover Tuning section in the Configuration Reference.
Batch sizes are now platform-specific and optimized automatically. If IMPORT_BATCH_SIZE is set, it acts as a global override for all platforms. In most cases you can safely remove this variable; if you keep it, it will override all platform-specific defaults.
Recommended Models
GPU Inference (vLLM)
| Task | Model | VRAM | Notes |
|---|---|---|---|
| Decision synthesis | meta-llama/Llama-3.1-8B-Instruct | 16 GB | Best for T4 GPU, good quality |
| Decision synthesis | mistralai/Mistral-7B-Instruct-v0.3 | 14 GB | Strong quality, fits T4 |
| Decision synthesis | meta-llama/Llama-3.1-70B-Instruct | 40 GB+ | Best local quality (needs A100) |
| Align Decision Model | align/decision-v1 | 16 GB | Fine-tuned for decisions (Enterprise) |
CPU Inference (Ollama / llama.cpp)
| Task | Model | RAM | Notes |
|---|---|---|---|
| Decision synthesis | llama3:8b | 8 GB | Good quality, reasonable speed |
| Decision synthesis | mistral:7b | 8 GB | Good balance |
| Decision synthesis | mixtral:8x7b | 26 GB | Strong quality |
Cloud APIs
| Task | Model | Notes |
|---|---|---|
| Decision synthesis | claude-sonnet-4 | Highest quality (cloud) |
| Decision synthesis | gpt-4o-mini | Best cost/quality ratio (cloud) |
| Historical scanning | gpt-4o-mini | Best for bulk scans (high rate limits) |
| Historical scanning | claude-haiku-4-5 | Cheap Anthropic option for scans |
| Embeddings | all-MiniLM-L6-v2 | Local, free, runs on CPU in Brain pod |
Configuration via UI
You can also configure LLM settings in the Align UI:
- Go to Settings → LLM Settings
- Select provider
- Enter credentials
- Save
UI-configured settings are stored encrypted in the database and take precedence over Helm values.
Troubleshooting
Connection refused
Ensure the LLM server is accessible from the Brain pod:
kubectl exec -it deploy/align-brain -n align -- \
curl http://ollama:11434/v1/models
Slow responses
- Use GPU acceleration (vLLM recommended for production)
- Reduce model size (8B instead of 70B)
- Increase Brain pod resources
Model not found
# For Ollama, pull the model first
kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b
JSON mode issues
Some local models don't support JSON mode reliably. Consider:
- Using models fine-tuned for structured output
- Falling back to OpenAI/Anthropic for critical tasks
Security
- API keys are encrypted at rest in the database (AES-256 via
ALIGN_MASTER_ENCRYPTION_KEY) - Self-hosted models (GPU or CPU) keep all data in your cluster - no inference data leaves your infrastructure
- Align's proprietary model weights are distributed via authenticated container registry pulls, gated by your license JWT
- Model weights run entirely locally after pull - no phone-home or telemetry during inference
- Use Kubernetes NetworkPolicies to restrict LLM server access to Brain pods only
- GPU nodes should use dedicated taints (
nvidia.com/gpu) to prevent non-GPU workloads from scheduling - For air-gapped environments, mirror the model server image to your internal registry