LLM Configuration
Configure AI providers for decision synthesis in self-hosted Align.
Overview
Align uses LLMs for:
- Decision synthesis - Extracting structured decisions from conversations
- Context understanding - Understanding the surrounding discussion
- Embeddings - Semantic search across decisions
Provider Options
| Provider | Pros | Cons |
|---|---|---|
| OpenAI | Best quality, easy setup | Data leaves your infra |
| Anthropic | High quality, safety focus | Data leaves your infra |
| Self-Hosted | Full data sovereignty | More setup, hardware needed |
Option 1: OpenAI
Setup
-
Get an API key from platform.openai.com
-
Create the secret:
kubectl create secret generic align-llm \
--namespace align \
--from-literal=openai-api-key="sk-..."
- Configure in Helm values:
secrets:
llm:
openaiApiKey: "" # Pulled from secret
Models Used
- GPT-4 - Decision synthesis
- text-embedding-3-small - Embeddings (or local)
Option 2: Anthropic
Setup
-
Get an API key from console.anthropic.com
-
Create the secret:
kubectl create secret generic align-llm \
--namespace align \
--from-literal=anthropic-api-key="sk-ant-..."
- Configure in Helm values:
secrets:
llm:
anthropicApiKey: "" # Pulled from secret
Models Used
- Claude 3 - Decision synthesis
Option 3: Self-Hosted Models
For complete data sovereignty, use your own LLM server.
Supported Servers
Any server implementing the OpenAI API format:
| Server | Use Case | Setup |
|---|---|---|
| Ollama | Easy local deployment | ollama serve |
| vLLM | Production GPU inference | Docker/K8s |
| LocalAI | CPU-friendly | Docker |
| LM Studio | Desktop GUI | App |
Quick Start with Ollama
- Deploy Ollama in your cluster:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: align
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: 1 # Optional: GPU
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: align
spec:
selector:
app: ollama
ports:
- port: 11434
- Pull a model:
kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b
- Configure Align:
# values.yaml
secrets:
llm:
custom:
baseUrl: "http://ollama.align.svc.cluster.local:11434/v1"
model: "llama3:70b"
apiKey: "" # Not needed for Ollama
useLocalEmbeddings: true
vLLM for Production
For high-throughput production use:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: align
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3-70b-chat-hf"
- "--tensor-parallel-size"
- "4"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 4
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
Configure Align:
secrets:
llm:
custom:
baseUrl: "http://vllm.align.svc.cluster.local:8000/v1"
model: "meta-llama/Llama-3-70b-chat-hf"
Embeddings
Local Embeddings (Default)
Align uses local sentence-transformers by default:
- Model:
all-MiniLM-L6-v2(384 dimensions) - Cost: Free (runs locally in Brain pod)
- Privacy: Data never leaves your cluster
Enable with:
secrets:
llm:
useLocalEmbeddings: true
OpenAI Embeddings
To use OpenAI embeddings instead:
secrets:
llm:
useLocalEmbeddings: false
openaiApiKey: "sk-..."
Discover Scan Tuning
The Discover feature scans connected tools (Slack, GitHub, Jira, Teams) for historical decisions. Align uses platform-specific batch sizes optimized for each connector type - no manual tuning is needed for most deployments.
How It Works
- Batch sizes are automatically set per platform (Slack: 12, GitHub: 5, Jira: 8) to balance detection quality with speed
- Confidence thresholds are platform-specific - messaging platforms (Slack, Teams) use lower thresholds to catch implicit decisions in threads
- LLM token budgets scale dynamically with batch size - smaller batches get less output budget, reducing latency and cost
- Event-driven completion - with Redis, scan progress updates are instant via pub/sub; in non-Kubernetes/local dev setups without Redis, a 15-second heartbeat provides reliable completion detection (Helm-based Kubernetes deployments require Redis)
Brain Service Configuration
| Variable | Default | Description |
|---|---|---|
HISTORICAL_ANALYSIS_MODEL | gpt-4o-mini | Model used for scanning historical items |
PREFERRED_PROVIDER | openai | LLM provider: openai, anthropic, or custom |
ANTHROPIC_MODEL | claude-sonnet-4-20250514 | Model when using Anthropic provider |
Recommendations by Provider
Local models (Ollama/vLLM):
brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "custom"
- name: HISTORICAL_ANALYSIS_MODEL
value: "llama3:70b" # Or your preferred model
Cloud APIs (OpenAI):
brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "openai"
# Uses gpt-4o-mini by default - best cost/quality ratio
Cloud APIs (Anthropic):
brain:
extraEnv:
- name: PREFERRED_PROVIDER
value: "anthropic"
- name: ANTHROPIC_MODEL
value: "claude-sonnet-4-20250514"
Gateway Scan Performance
For tuning scan parallelism and worker concurrency, see the Discover Tuning section in the Configuration Reference.
Batch sizes are now platform-specific and optimized automatically. If IMPORT_BATCH_SIZE is set, it acts as a global override for all platforms. In most cases you can safely remove this variable; if you keep it, it will override all platform-specific defaults.
Recommended Models
| Task | Model | Notes |
|---|---|---|
| Decision synthesis | llama3:70b | Best quality for local |
| Decision synthesis | llama3:8b | Faster, lower resource |
| Decision synthesis | mistral:7b | Good balance |
| Decision synthesis | mixtral:8x7b | Strong quality, 26GB VRAM |
| Decision synthesis | gpt-4o-mini | Best cost/quality (cloud) |
| Decision synthesis | claude-sonnet-4 | High quality (cloud) |
| Historical scanning | llama3:8b | Fast local scanning |
| Historical scanning | gpt-4o-mini | Best cloud option (no TPM issues) |
| Embeddings | all-MiniLM-L6-v2 | Local, fast, good quality |
Configuration via UI
You can also configure LLM settings in the Align UI:
- Go to Settings → LLM Settings
- Select provider
- Enter credentials
- Save
UI-configured settings are stored encrypted in the database and take precedence over Helm values.
Troubleshooting
Connection refused
Ensure the LLM server is accessible from the Brain pod:
kubectl exec -it deploy/align-brain -n align -- \
curl http://ollama:11434/v1/models
Slow responses
- Use GPU acceleration (vLLM recommended for production)
- Reduce model size (8B instead of 70B)
- Increase Brain pod resources
Model not found
# For Ollama, pull the model first
kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b
JSON mode issues
Some local models don't support JSON mode reliably. Consider:
- Using models fine-tuned for structured output
- Falling back to OpenAI/Anthropic for critical tasks
Security
- API keys are encrypted at rest
- Self-hosted models keep all data in your cluster
- Use Kubernetes NetworkPolicies to restrict LLM server access
- Consider mTLS for production deployments