Skip to main content

Observability & Monitoring

Align uses OpenTelemetry (OTel) as a vendor-neutral observability layer. You can route traces, metrics, and logs to any backend your organization uses - Datadog, Grafana, Prometheus, New Relic, Splunk, or any OTel-compatible system.

No phone-home

All telemetry stays within your infrastructure. Align does not collect, transmit, or have access to any of your observability data. There are no hardcoded external endpoints, no usage reporting, and no analytics beacons. Self-hosted Align is fully air-gap compatible. See Data Privacy below for details.

Architecture

+----------+ +----------+ +----------+
| Gateway | | Brain | |Connectors|
| (Node.js)| | (Python) | | (Node.js)|
+----+-----+ +----+-----+ +----+-----+
| | |
+-------+-------+-------+-------+
| OTLP (HTTP) |
v v
+-----+---------------+-----+
| OTel Collector |
| (routes to your backend) |
+--+------+------+------+--+
| | | |
v v v v
Datadog Grafana Prom CloudWatch
Cloud (or any OTLP
backend)

What's collected:

  • Traces - Distributed request tracing across gateway, brain, and connectors (auto-instrumented HTTP, database, Redis)
  • Metrics - API latency, error rates, request counts, plus business metrics (decisions captured, conflicts detected, etc.)
  • Logs - Structured JSON logs enriched with trace context (traceId, spanId) for log-to-trace correlation

Quick Start

1. Enable Observability

# In your values override file
observability:
enabled: true

This deploys the OTel Collector and configures all services to send telemetry via OTLP.

2. Configure Your Backend

Choose one or more backends. The OTel Collector can send to multiple destinations simultaneously.

Grafana Cloud

observability:
enabled: true
grafanaCloud:
enabled: true
otlpEndpoint: "https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
# Create a secret with your Grafana Cloud API token:
# kubectl create secret generic align-grafana-cloud \
# --from-literal=token="your-grafana-cloud-api-token"
authToken: "" # Or reference a secret (see below)
Free tier

Grafana Cloud's free tier includes 10,000 metrics series, 50GB logs, and 50GB traces per month - more than enough for most deployments.

Datadog

observability:
enabled: true
datadog:
enabled: true
site: "datadoghq.com" # or datadoghq.eu, us3.datadoghq.com, etc.
# Create a secret with your Datadog API key:
# kubectl create secret generic align-datadog \
# --from-literal=api-key="your-datadog-api-key"
apiKey: ""

Prometheus (Self-Hosted)

observability:
enabled: true
prometheus:
enabled: true
# The collector exposes a /metrics endpoint for Prometheus to scrape

With Prometheus enabled, the OTel Collector exposes a /metrics endpoint. Configure your Prometheus instance to scrape it:

# prometheus.yml
scrape_configs:
- job_name: 'align-otel-collector'
static_configs:
- targets: ['align-otel-collector:8889']

Custom OTLP Backend

For any OTLP-compatible backend (New Relic, Honeycomb, Lightstep, Jaeger, etc.), use extraEnv on the gateway to set the standard OTel environment variables:

gateway:
extraEnv:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "https://your-otlp-endpoint:4318"
- name: OTEL_EXPORTER_OTLP_HEADERS
value: "api-key=your-api-key"

3. Verify

Once deployed, check the OTel Collector is running:

kubectl get pods -n align | grep otel-collector
kubectl logs -n align deploy/align-otel-collector

What Gets Instrumented

Auto-Instrumented (zero code changes)

SignalWhatDetails
TracesHTTP requestsEvery inbound/outbound HTTP call across all services
TracesDatabase queriesPostgreSQL query duration, statement type
TracesRedis operationsCache hits/misses, pub/sub
MetricsRequest latencyp50, p95, p99 by endpoint
MetricsError rates4xx/5xx by service and endpoint
LogsStructured JSONAll service logs with trace context (traceId, spanId)

Business Metrics (Align-specific)

These custom metrics are exported alongside infrastructure metrics:

MetricTypeDescription
align.decisions.capturedCounterDecisions captured (by platform, method)
align.conflicts.detectedCounterConflicts detected (by severity)
align.duplicates.detectedCounterDuplicates detected (by similarity range)
align.webhooks.processedCounterWebhooks processed (by platform, success/failure)
align.searches.executedCounterSearches executed (by type: keyword/semantic/hybrid)
align.extraction.durationHistogramAI extraction duration in ms (by confidence range)

Configuration Reference

Full Values

observability:
# Master switch - enables OTel Collector and SDK instrumentation
enabled: false

grafanaCloud:
enabled: false
otlpEndpoint: "" # e.g., https://otlp-gateway-prod-eu-west-2.grafana.net/otlp
authToken: "" # Grafana Cloud API token (base64-encoded instanceId:token)

datadog:
enabled: false
apiKey: "" # Datadog API key
site: "datadoghq.com" # Datadog site

prometheus:
enabled: false # Expose /metrics endpoint for Prometheus scraping

Environment Variables

These are automatically set by the Helm chart when observability.enabled: true:

VariableDefaultDescription
OTEL_ENABLEDfalseMaster switch for OTel SDK
OTEL_SERVICE_NAMEalign-<service>Service name in traces/metrics
OTEL_EXPORTER_OTLP_ENDPOINThttp://align-otel-collector:4318OTLP endpoint
OTEL_EXPORTER_OTLP_PROTOCOLhttp/protobufOTLP protocol
OTEL_RESOURCE_ATTRIBUTESdeployment.environment=<env>Resource attributes

You can override any of these per-service using extraEnv.

Log Levels

All services support the LOG_LEVEL environment variable:

gateway:
extraEnv:
- name: LOG_LEVEL
value: "debug" # trace, debug, info, warn, error, fatal

Local Development

For local development, use Docker Compose with the observability profile:

# Start with OTel Collector + Jaeger UI
docker compose --profile observability up

# View traces in Jaeger
open http://localhost:16686

This starts an OTel Collector (ports 4317/4318) and Jaeger for local trace visualization without any cloud backend.

Business Telemetry vs Infrastructure Observability

Align maintains two complementary telemetry systems:

Business TelemetryInfrastructure Observability
PurposeProduct analytics, usage trackingService health, debugging
StoragePostgreSQL (telemetry_events table)Your OTel backend (Grafana, Datadog, etc.)
DataDecisions captured, conflicts, engagementRequest latency, error rates, traces
AggregationHourly/daily rollups via CronJobBackend-native (e.g., Prometheus rules)
Configtelemetry.* valuesobservability.* values

Both systems run independently. Business telemetry is always stored in PostgreSQL regardless of OTel configuration.

Disabling Observability

To run without any OTel overhead (default):

observability:
enabled: false

When disabled:

  • No OTel Collector is deployed
  • OTel SDKs are disabled in all services (near-zero overhead)
  • Structured logging still works (logs go to stdout as structured JSON)
  • Business telemetry still records to PostgreSQL
  • Health check endpoints (/health, /ready) still function

Data Privacy & Sovereignty

Align's observability architecture is designed with a strict no phone-home policy. This section explains exactly what happens with your data.

What stays local (always)

DataWhere it livesExternal access
Business telemetry eventsYour PostgreSQL databaseNone
Hourly/daily metric rollupsYour PostgreSQL databaseNone
Structured application logsstdout (your log aggregator)None
License validationLocal JWT signature checkNone
Health check responsesIn-cluster onlyNone

What you control (opt-in only)

When you enable observability.enabled: true, the OTel Collector runs inside your cluster and sends data only to backends you configure:

  • Grafana Cloud - your account, your endpoint, your API token
  • Datadog - your account, your API key
  • Prometheus - your instance scrapes a local /metrics endpoint
  • Custom OTLP - any endpoint you specify

If you don't configure a backend, telemetry goes nowhere. The collector receives data from Align services but has no default export destination.

What Align never does

  • No usage analytics sent to Align servers
  • No crash/error reporting to external services
  • No update or version checks phoning home
  • No telemetry data forwarded to Align
  • No third-party analytics SDKs (Segment, Mixpanel, PostHog, etc.)
  • No hardcoded external endpoints in any service or collector config

Air-gapped deployments

Align works fully offline. For air-gapped environments:

  1. Mirror container images to your internal registry
  2. Set global.imageRegistry to your internal registry
  3. Optionally enable observability with an internal Prometheus or Grafana instance
  4. No internet connectivity required at any point after image pull

Compliance notes

  • GDPR/CCPA: No personal data leaves your infrastructure via telemetry. Business metrics (decision counts, conflict counts) are aggregate counters with no PII.
  • SOC 2: Observability is fully opt-in with audit-friendly configuration (all settings in Helm values, no hidden defaults).
  • HIPAA: No PHI is included in telemetry signals. Trace data includes HTTP paths and status codes but not request/response bodies.

Troubleshooting

No traces appearing

  1. Check the OTel Collector is running: kubectl get pods | grep otel
  2. Check collector logs: kubectl logs deploy/align-otel-collector
  3. Verify OTEL_ENABLED=true in service pods: kubectl exec deploy/align-gateway -- env | grep OTEL
  4. Ensure your backend credentials are correct (check collector logs for auth errors)

High cardinality warnings

If your backend warns about high cardinality, reduce the attributes on business metrics by setting sampling:

gateway:
extraEnv:
- name: TELEMETRY_SAMPLING_RATE
value: "0.1" # Sample 10% of business events

Collector resource usage

The default OTel Collector is lightweight (128Mi memory). For high-traffic deployments:

# Override collector resources in your values
# The collector template supports standard K8s resource specs