Observability & Monitoring
Align uses OpenTelemetry (OTel) as a vendor-neutral observability layer. You can route traces, metrics, and logs to any backend your organization uses - Datadog, Grafana, Prometheus, New Relic, Splunk, or any OTel-compatible system.
All telemetry stays within your infrastructure. Align does not collect, transmit, or have access to any of your observability data. There are no hardcoded external endpoints, no usage reporting, and no analytics beacons. Self-hosted Align is fully air-gap compatible. See Data Privacy below for details.
Architecture
+----------+ +----------+ +----------+
| Gateway | | Brain | |Connectors|
| (Node.js)| | (Python) | | (Node.js)|
+----+-----+ +----+-----+ +----+-----+
| | |
+-------+-------+-------+-------+
| OTLP (HTTP) |
v v
+-----+---------------+-----+
| OTel Collector |
| (routes to your backend) |
+--+------+------+------+--+
| | | |
v v v v
Datadog Grafana Prom CloudWatch
Cloud (or any OTLP
backend)
What's collected:
- Traces - Distributed request tracing across gateway, brain, and connectors (auto-instrumented HTTP, database, Redis)
- Metrics - API latency, error rates, request counts, plus business metrics (decisions captured, conflicts detected, etc.)
- Logs - Structured JSON logs enriched with trace context (traceId, spanId) for log-to-trace correlation
Quick Start
1. Enable Observability
# In your values override file
observability:
enabled: true
This deploys the OTel Collector and configures all services to send telemetry via OTLP.
2. Configure Your Backend
Choose one or more backends. The OTel Collector can send to multiple destinations simultaneously.
Grafana Cloud
observability:
enabled: true
grafanaCloud:
enabled: true
otlpEndpoint: "https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
# Create a secret with your Grafana Cloud API token:
# kubectl create secret generic align-grafana-cloud \
# --from-literal=token="your-grafana-cloud-api-token"
authToken: "" # Or reference a secret (see below)
Grafana Cloud's free tier includes 10,000 metrics series, 50GB logs, and 50GB traces per month - more than enough for most deployments.
Datadog
observability:
enabled: true
datadog:
enabled: true
site: "datadoghq.com" # or datadoghq.eu, us3.datadoghq.com, etc.
# Create a secret with your Datadog API key:
# kubectl create secret generic align-datadog \
# --from-literal=api-key="your-datadog-api-key"
apiKey: ""
Prometheus (Self-Hosted)
observability:
enabled: true
prometheus:
enabled: true
# The collector exposes a /metrics endpoint for Prometheus to scrape
With Prometheus enabled, the OTel Collector exposes a /metrics endpoint. Configure your Prometheus instance to scrape it:
# prometheus.yml
scrape_configs:
- job_name: 'align-otel-collector'
static_configs:
- targets: ['align-otel-collector:8889']
Custom OTLP Backend
For any OTLP-compatible backend (New Relic, Honeycomb, Lightstep, Jaeger, etc.), use extraEnv on the gateway to set the standard OTel environment variables:
gateway:
extraEnv:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "https://your-otlp-endpoint:4318"
- name: OTEL_EXPORTER_OTLP_HEADERS
value: "api-key=your-api-key"
3. Verify
Once deployed, check the OTel Collector is running:
kubectl get pods -n align | grep otel-collector
kubectl logs -n align deploy/align-otel-collector
What Gets Instrumented
Auto-Instrumented (zero code changes)
| Signal | What | Details |
|---|---|---|
| Traces | HTTP requests | Every inbound/outbound HTTP call across all services |
| Traces | Database queries | PostgreSQL query duration, statement type |
| Traces | Redis operations | Cache hits/misses, pub/sub |
| Metrics | Request latency | p50, p95, p99 by endpoint |
| Metrics | Error rates | 4xx/5xx by service and endpoint |
| Logs | Structured JSON | All service logs with trace context (traceId, spanId) |
Business Metrics (Align-specific)
These custom metrics are exported alongside infrastructure metrics:
| Metric | Type | Description |
|---|---|---|
align.decisions.captured | Counter | Decisions captured (by platform, method) |
align.conflicts.detected | Counter | Conflicts detected (by severity) |
align.duplicates.detected | Counter | Duplicates detected (by similarity range) |
align.webhooks.processed | Counter | Webhooks processed (by platform, success/failure) |
align.searches.executed | Counter | Searches executed (by type: keyword/semantic/hybrid) |
align.extraction.duration | Histogram | AI extraction duration in ms (by confidence range) |
Configuration Reference
Full Values
observability:
# Master switch - enables OTel Collector and SDK instrumentation
enabled: false
grafanaCloud:
enabled: false
otlpEndpoint: "" # e.g., https://otlp-gateway-prod-eu-west-2.grafana.net/otlp
authToken: "" # Grafana Cloud API token (base64-encoded instanceId:token)
datadog:
enabled: false
apiKey: "" # Datadog API key
site: "datadoghq.com" # Datadog site
prometheus:
enabled: false # Expose /metrics endpoint for Prometheus scraping
Environment Variables
These are automatically set by the Helm chart when observability.enabled: true:
| Variable | Default | Description |
|---|---|---|
OTEL_ENABLED | false | Master switch for OTel SDK |
OTEL_SERVICE_NAME | align-<service> | Service name in traces/metrics |
OTEL_EXPORTER_OTLP_ENDPOINT | http://align-otel-collector:4318 | OTLP endpoint |
OTEL_EXPORTER_OTLP_PROTOCOL | http/protobuf | OTLP protocol |
OTEL_RESOURCE_ATTRIBUTES | deployment.environment=<env> | Resource attributes |
You can override any of these per-service using extraEnv.
Log Levels
All services support the LOG_LEVEL environment variable:
gateway:
extraEnv:
- name: LOG_LEVEL
value: "debug" # trace, debug, info, warn, error, fatal
Local Development
For local development, use Docker Compose with the observability profile:
# Start with OTel Collector + Jaeger UI
docker compose --profile observability up
# View traces in Jaeger
open http://localhost:16686
This starts an OTel Collector (ports 4317/4318) and Jaeger for local trace visualization without any cloud backend.
Business Telemetry vs Infrastructure Observability
Align maintains two complementary telemetry systems:
| Business Telemetry | Infrastructure Observability | |
|---|---|---|
| Purpose | Product analytics, usage tracking | Service health, debugging |
| Storage | PostgreSQL (telemetry_events table) | Your OTel backend (Grafana, Datadog, etc.) |
| Data | Decisions captured, conflicts, engagement | Request latency, error rates, traces |
| Aggregation | Hourly/daily rollups via CronJob | Backend-native (e.g., Prometheus rules) |
| Config | telemetry.* values | observability.* values |
Both systems run independently. Business telemetry is always stored in PostgreSQL regardless of OTel configuration.
Disabling Observability
To run without any OTel overhead (default):
observability:
enabled: false
When disabled:
- No OTel Collector is deployed
- OTel SDKs are disabled in all services (near-zero overhead)
- Structured logging still works (logs go to stdout as structured JSON)
- Business telemetry still records to PostgreSQL
- Health check endpoints (
/health,/ready) still function
Data Privacy & Sovereignty
Align's observability architecture is designed with a strict no phone-home policy. This section explains exactly what happens with your data.
What stays local (always)
| Data | Where it lives | External access |
|---|---|---|
| Business telemetry events | Your PostgreSQL database | None |
| Hourly/daily metric rollups | Your PostgreSQL database | None |
| Structured application logs | stdout (your log aggregator) | None |
| License validation | Local JWT signature check | None |
| Health check responses | In-cluster only | None |
What you control (opt-in only)
When you enable observability.enabled: true, the OTel Collector runs inside your cluster and sends data only to backends you configure:
- Grafana Cloud - your account, your endpoint, your API token
- Datadog - your account, your API key
- Prometheus - your instance scrapes a local
/metricsendpoint - Custom OTLP - any endpoint you specify
If you don't configure a backend, telemetry goes nowhere. The collector receives data from Align services but has no default export destination.
What Align never does
- No usage analytics sent to Align servers
- No crash/error reporting to external services
- No update or version checks phoning home
- No telemetry data forwarded to Align
- No third-party analytics SDKs (Segment, Mixpanel, PostHog, etc.)
- No hardcoded external endpoints in any service or collector config
Air-gapped deployments
Align works fully offline. For air-gapped environments:
- Mirror container images to your internal registry
- Set
global.imageRegistryto your internal registry - Optionally enable observability with an internal Prometheus or Grafana instance
- No internet connectivity required at any point after image pull
Compliance notes
- GDPR/CCPA: No personal data leaves your infrastructure via telemetry. Business metrics (decision counts, conflict counts) are aggregate counters with no PII.
- SOC 2: Observability is fully opt-in with audit-friendly configuration (all settings in Helm values, no hidden defaults).
- HIPAA: No PHI is included in telemetry signals. Trace data includes HTTP paths and status codes but not request/response bodies.
Troubleshooting
No traces appearing
- Check the OTel Collector is running:
kubectl get pods | grep otel - Check collector logs:
kubectl logs deploy/align-otel-collector - Verify
OTEL_ENABLED=truein service pods:kubectl exec deploy/align-gateway -- env | grep OTEL - Ensure your backend credentials are correct (check collector logs for auth errors)
High cardinality warnings
If your backend warns about high cardinality, reduce the attributes on business metrics by setting sampling:
gateway:
extraEnv:
- name: TELEMETRY_SAMPLING_RATE
value: "0.1" # Sample 10% of business events
Collector resource usage
The default OTel Collector is lightweight (128Mi memory). For high-traffic deployments:
# Override collector resources in your values
# The collector template supports standard K8s resource specs