LLM Configuration

Configure AI providers for decision synthesis in self-hosted Align.

Overview

Align uses LLMs for:

Decision synthesis - Extracting structured decisions from conversations
Context understanding - Understanding the surrounding discussion
Embeddings - Semantic search across decisions

Provider Options

Provider	Pros	Cons
OpenAI	Best quality, easy setup	Data leaves your infra
Anthropic	High quality, safety focus	Data leaves your infra
GPU Inference	Full sovereignty, flat cost, Align models	GPU hardware required
CPU Inference	Sovereignty, no GPU needed	Slower, smaller models only

Option 1: OpenAI

Setup

Get an API key from platform.openai.com
Create the secret:

kubectl create secret generic align-llm \
  --namespace align \
  --from-literal=openai-api-key="sk-..."

Configure in Helm values:

secrets:
  llm:
    openaiApiKey: ""  # Pulled from secret

Models Used

GPT-4 - Decision synthesis
text-embedding-3-small - Embeddings (or local)

Option 2: Anthropic

Setup

Get an API key from console.anthropic.com
Create the secret:

kubectl create secret generic align-llm \
  --namespace align \
  --from-literal=anthropic-api-key="sk-ant-..."

Configure in Helm values:

secrets:
  llm:
    anthropicApiKey: ""  # Pulled from secret

Models Used

Claude 3 - Decision synthesis

Option 3: GPU Inference (Recommended for Enterprise)

Run Align's proprietary decision models or open-source LLMs on GPU nodes in your cluster. This provides the best combination of quality, speed, cost, and data sovereignty.

How It Works

Align includes a built-in vLLM deployment that runs on GPU nodes:

Brain Service ──► vLLM Server (GPU node) ──► Llama 8B / Align Decision Model
                       │
                  OpenAI-compatible API
                  (no code changes needed)

The Brain service automatically routes inference to the local vLLM server when LOCAL_LLM_SERVER_URL is set, with cloud APIs as fallback if the local server is unavailable.

Built-in GPU Deployment

Enable GPU inference in your Helm values:

gpu:
  # NVIDIA device plugin (detects GPUs on nodes)
  devicePlugin:
    enabled: true

  # vLLM inference server
  llmServer:
    enabled: true
    image:
      repository: vllm/vllm-openai
      tag: "v0.8.5"
    port: 8001
    resources:
      requests:
        memory: "14Gi"
        cpu: "2000m"
        nvidia.com/gpu: "1"
      limits:
        memory: "16Gi"
        cpu: "4000m"
        nvidia.com/gpu: "1"

The Helm chart automatically:

Deploys the NVIDIA device plugin DaemonSet on GPU nodes
Deploys the vLLM server with GPU resource requests
Injects LOCAL_LLM_SERVER_URL into Brain pods
Sets up health checks with generous timeouts (model loading takes 60-120s)

GPU Node Requirements

GPU	VRAM	Models	Instance (AWS)	Cost
NVIDIA T4	16 GB	Llama 8B, Mistral 7B	g4dn.xlarge	~$380/mo
NVIDIA A10G	24 GB	Llama 13B, Mixtral 8x7B	g5.xlarge	~$660/mo
NVIDIA A100	40 GB	Llama 70B (quantized)	p4d.24xlarge	~$7,000/mo

For most deployments, a single NVIDIA T4 running Llama 8B provides excellent quality for decision synthesis at a flat monthly cost.

GPU Node Setup (Kubernetes)

Your GPU nodes need:

NVIDIA drivers installed (use GPU-optimized AMIs like al2023-nvidia@latest on EKS)
Node label: node-type: gpu
Taint: nvidia.com/gpu=true:NoSchedule (prevents non-GPU pods from scheduling)

Example for AWS EKS with Karpenter:

# Karpenter NodePool for GPU workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu
spec:
  template:
    metadata:
      labels:
        node-type: gpu
    spec:
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g4dn.xlarge", "g4dn.2xlarge"]

Align Decision Models (Enterprise License)

Enterprise customers can run Align's proprietary fine-tuned models that are trained on decision patterns across tenants:

gpu:
  llmServer:
    enabled: true
    image:
      # Align's model server (includes fine-tuned weights)
      repository: registry.align.tech/align/llm-server
      tag: "v1.0.0"

Model weights are distributed via the same container registry used for all Align images. Your license JWT gates registry access - the same credentials that pull Brain/Gateway images also pull the model server. Models run entirely in your infrastructure with no phone-home.

Environment Variables

Variable	Default	Description
`LOCAL_LLM_SERVER_URL`	(none)	URL of local vLLM server. Auto-set by Helm when `gpu.llmServer.enabled`
`LOCAL_LLM_MODEL`	`meta-llama/Llama-3.1-8B-Instruct`	Model name served by vLLM
`LOCAL_LLM_FOR_SCANS_ONLY`	`true`	When `true`, only Discover scans use local GPU; synthesis stays on cloud API for quality. Set `false` to route all operations to local GPU

Routing Behavior

When LOCAL_LLM_SERVER_URL is set, routing depends on LOCAL_LLM_FOR_SCANS_ONLY:

Scan-only mode (LOCAL_LLM_FOR_SCANS_ONLY=true, default):

Decision synthesis (relationship detection, analysis) - routes to cloud API (Claude Sonnet/GPT-4o) for best quality
High-throughput operations (Discover scans, fast analysis) - routes to local GPU to save API costs. These are identified by internal force flags, which cover Discover historical scans and other bulk operations.
Fallback - if the local server is unreachable or returns errors, operations fall back to cloud APIs

This is the recommended mode: cloud APIs provide the best quality for customer-facing synthesis, while the local GPU handles high-volume operations at flat cost.

Full local mode (LOCAL_LLM_FOR_SCANS_ONLY=false):

All normal requests route to the local GPU server
Force-cloud operations bypass local and use cloud APIs
Fallback - if the local server is unreachable or returns errors, requests fall back to cloud APIs

Use full local mode for maximum data sovereignty or to eliminate cloud API costs entirely.

Option 4: Self-Hosted Models (CPU)

For deployments without GPU hardware, use CPU-based inference servers.

Supported Servers

Any server implementing the OpenAI API format:

Server	Use Case	Setup
Ollama	Easy local deployment	`ollama serve`
vLLM	Production GPU inference	Docker/K8s
LocalAI	CPU-friendly	Docker
llama.cpp	GGUF models on CPU	Binary/Docker

Quick Start with Ollama

Deploy Ollama in your cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: align
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: 1  # Optional: GPU
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: align
spec:
  selector:
    app: ollama
  ports:
    - port: 11434

Pull a model:

kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b

Configure Align:

# values.yaml
secrets:
  llm:
    custom:
      baseUrl: "http://ollama.align.svc.cluster.local:11434/v1"
      model: "llama3:70b"
      apiKey: ""  # Not needed for Ollama
    useLocalEmbeddings: true

vLLM for Production

For high-throughput production use:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: align
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3-70b-chat-hf"
            - "--tensor-parallel-size"
            - "4"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 4
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token

Configure Align:

secrets:
  llm:
    custom:
      baseUrl: "http://vllm.align.svc.cluster.local:8000/v1"
      model: "meta-llama/Llama-3-70b-chat-hf"

Embeddings

Local Embeddings (Default)

Align uses local sentence-transformers by default:

Model: all-MiniLM-L6-v2 (384 dimensions)
Cost: Free (runs locally in Brain pod)
Privacy: Data never leaves your cluster

Enable with:

secrets:
  llm:
    useLocalEmbeddings: true

OpenAI Embeddings

To use OpenAI embeddings instead:

secrets:
  llm:
    useLocalEmbeddings: false
    openaiApiKey: "sk-..."

Discover Scan Tuning

The Discover feature scans connected tools (Slack, GitHub, Jira, Teams) for historical decisions. Align uses platform-specific batch sizes optimized for each connector type - no manual tuning is needed for most deployments.

How It Works

Batch sizes are automatically set per platform (Slack: 12, GitHub: 5, Jira: 8) to balance detection quality with speed
Confidence thresholds are platform-specific - messaging platforms (Slack, Teams) use lower thresholds to catch implicit decisions in threads
LLM token budgets scale dynamically with batch size - smaller batches get less output budget, reducing latency and cost
Event-driven completion - with Redis, scan progress updates are instant via pub/sub; in non-Kubernetes/local dev setups without Redis, a 15-second heartbeat provides reliable completion detection (Helm-based Kubernetes deployments require Redis)
GPU inference note - Discover scans use force flags (force_openai / force_anthropic) to bypass the local GPU server and use cloud APIs directly. This is because bulk scan operations need high throughput and rate limits that a single GPU can't match. Normal decision analysis (synthesis, relationship detection) uses the local GPU

Brain Service Configuration

Variable	Default	Description
`HISTORICAL_ANALYSIS_MODEL`	`gpt-4o-mini`	Model used for scanning historical items
`PREFERRED_PROVIDER`	`openai`	LLM provider: `openai`, `anthropic`, or `custom`
`ANTHROPIC_MODEL`	`claude-sonnet-4-20250514`	Model when using Anthropic provider

Recommendations by Provider

Local models (Ollama/vLLM):

brain:
  extraEnv:
    - name: PREFERRED_PROVIDER
      value: "custom"
    - name: HISTORICAL_ANALYSIS_MODEL
      value: "llama3:70b"  # Or your preferred model

Cloud APIs (OpenAI):

brain:
  extraEnv:
    - name: PREFERRED_PROVIDER
      value: "openai"
    # Uses gpt-4o-mini by default - best cost/quality ratio

Cloud APIs (Anthropic):

brain:
  extraEnv:
    - name: PREFERRED_PROVIDER
      value: "anthropic"
    - name: ANTHROPIC_MODEL
      value: "claude-sonnet-4-20250514"

Gateway Scan Performance

For tuning scan parallelism and worker concurrency, see the Discover Tuning section in the Configuration Reference.

IMPORT_BATCH_SIZE is now a global override

Batch sizes are now platform-specific and optimized automatically. If IMPORT_BATCH_SIZE is set, it acts as a global override for all platforms. In most cases you can safely remove this variable; if you keep it, it will override all platform-specific defaults.

Recommended Models

GPU Inference (vLLM)

Task	Model	VRAM	Notes
Decision synthesis	`meta-llama/Llama-3.1-8B-Instruct`	16 GB	Best for T4 GPU, good quality
Decision synthesis	`mistralai/Mistral-7B-Instruct-v0.3`	14 GB	Strong quality, fits T4
Decision synthesis	`meta-llama/Llama-3.1-70B-Instruct`	40 GB+	Best local quality (needs A100)
Align Decision Model	`align/decision-v1`	16 GB	Fine-tuned for decisions (Enterprise)

CPU Inference (Ollama / llama.cpp)

Task	Model	RAM	Notes
Decision synthesis	`llama3:8b`	8 GB	Good quality, reasonable speed
Decision synthesis	`mistral:7b`	8 GB	Good balance
Decision synthesis	`mixtral:8x7b`	26 GB	Strong quality

Cloud APIs

Task	Model	Notes
Decision synthesis	`claude-sonnet-4`	Highest quality (cloud)
Decision synthesis	`gpt-4o-mini`	Best cost/quality ratio (cloud)
Historical scanning	`gpt-4o-mini`	Best for bulk scans (high rate limits)
Historical scanning	`claude-haiku-4-5`	Cheap Anthropic option for scans
Embeddings	`all-MiniLM-L6-v2`	Local, free, runs on CPU in Brain pod

Configuration via UI

You can also configure LLM settings in the Align UI:

Go to Settings → LLM Settings
Select provider
Enter credentials
Save

UI-configured settings are stored encrypted in the database and take precedence over Helm values.

Troubleshooting

Connection refused

Ensure the LLM server is accessible from the Brain pod:

kubectl exec -it deploy/align-brain -n align -- \
  curl http://ollama:11434/v1/models

Slow responses

Use GPU acceleration (vLLM recommended for production)
Reduce model size (8B instead of 70B)
Increase Brain pod resources

Model not found

# For Ollama, pull the model first
kubectl exec -it deploy/ollama -n align -- ollama pull llama3:70b

JSON mode issues

Some local models don't support JSON mode reliably. Consider:

Using models fine-tuned for structured output
Falling back to OpenAI/Anthropic for critical tasks

Security

API keys and OAuth tokens are encrypted at rest in the database (AES-256 via ALIGN_MASTER_ENCRYPTION_KEY). This key is required: the gateway refuses to start without it rather than storing secrets in plaintext (ALIGN_ALLOW_PLAINTEXT_TOKENS=true is a local-dev-only escape hatch).
Self-hosted models (GPU or CPU) keep all data in your cluster - no inference data leaves your infrastructure
Align's proprietary model weights are distributed via authenticated container registry pulls, gated by your license JWT
Model weights run entirely locally after pull - no phone-home or telemetry during inference
Use Kubernetes NetworkPolicies to restrict LLM server access to Brain pods only
GPU nodes should use dedicated taints (nvidia.com/gpu) to prevent non-GPU workloads from scheduling
For air-gapped environments, mirror the model server image to your internal registry

Overview​

Provider Options​

Option 1: OpenAI​

Setup​

Models Used​

Option 2: Anthropic​

Setup​

Models Used​

Option 3: GPU Inference (Recommended for Enterprise)​

How It Works​

Built-in GPU Deployment​

GPU Node Requirements​

GPU Node Setup (Kubernetes)​

Align Decision Models (Enterprise License)​

Environment Variables​

Routing Behavior​

Option 4: Self-Hosted Models (CPU)​

Supported Servers​

Quick Start with Ollama​

vLLM for Production​

Embeddings​

Local Embeddings (Default)​

OpenAI Embeddings​

Discover Scan Tuning​

How It Works​

Brain Service Configuration​

Recommendations by Provider​

Gateway Scan Performance​

Recommended Models​

GPU Inference (vLLM)​

CPU Inference (Ollama / llama.cpp)​

Cloud APIs​

Configuration via UI​

Troubleshooting​

Connection refused​

Slow responses​

Model not found​

JSON mode issues​

Security​

Overview

Provider Options

Option 1: OpenAI

Setup

Models Used

Option 2: Anthropic

Setup

Models Used

Option 3: GPU Inference (Recommended for Enterprise)

How It Works

Built-in GPU Deployment

GPU Node Requirements

GPU Node Setup (Kubernetes)

Align Decision Models (Enterprise License)

Environment Variables

Routing Behavior

Option 4: Self-Hosted Models (CPU)

Supported Servers

Quick Start with Ollama

vLLM for Production

Embeddings

Local Embeddings (Default)

OpenAI Embeddings

Discover Scan Tuning

How It Works

Brain Service Configuration

Recommendations by Provider

Gateway Scan Performance

Recommended Models

GPU Inference (vLLM)

CPU Inference (Ollama / llama.cpp)

Cloud APIs

Configuration via UI

Troubleshooting

Connection refused

Slow responses

Model not found

JSON mode issues

Security