Monitoring & observability
TL;DR — AuthPlane emits Prometheus metrics on
/metricsby default, structured slog with trace/span/request IDs, and optional OpenTelemetry traces + metrics via OTLP. This guide has the Prometheus scrape config, six alert rules worth pasting straight in, a Grafana dashboard skeleton, and the OTEL config for the collector. Metrics catalog + names live in Reference: Metrics & CLI.
Prometheus scrape
AuthPlane exposes /metrics on the admin port (9001 by default; registered ahead of the admin API-key auth, so scrapers don’t need credentials) in Prometheus text format. Path and provider are configurable:
observability:
metrics:
provider: prometheus # "prometheus" | "otel" | "both" | "none"
path: /metrics
Sample prometheus.yml:
scrape_configs:
- job_name: authplane
metrics_path: /metrics
scrape_interval: 15s
basic_auth:
username: metrics # any non-empty username
password: ${AUTHPLANE_ADMIN_API_KEY}
static_configs:
- targets: ['authplane:9001']
For Kubernetes with the AuthPlane Helm chart, the chart ships a ServiceMonitor (Prometheus Operator CRD):
serviceMonitor:
enabled: true
interval: 15s
Six alerts worth having
Start here, tune thresholds to your traffic.
groups:
- name: authplane
rules:
- alert: AuthPlaneAuthDeniedSpike
expr: rate(authserver_auth_denied_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Auth denials spiking on {{ $labels.instance }}"
description: "> 10 auth denials/sec over 5 min — possible attack or misconfig"
- alert: AuthPlaneRefreshTokenReuse
expr: rate(authserver_refresh_token_reuse_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Refresh-token theft detected on {{ $labels.instance }}"
description: "Refresh-token family revocation triggered — investigate immediately"
- alert: AuthPlaneDPoPRejectionSpike
expr: rate(authplane_dpop_proofs_rejected_total[5m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "DPoP rejections spiking on {{ $labels.instance }}"
description: "> 5 DPoP proof rejections/sec — client bug, replay attack, or reverse-proxy htu mismatch"
- alert: AuthPlaneTokenIssuanceSlow
expr: histogram_quantile(0.99, rate(authserver_token_issuance_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "p99 token issuance > 500 ms on {{ $labels.instance }}"
description: "DB or Vault Transit latency degrading the token endpoint"
- alert: AuthPlaneKeyRotationStale
# No gauge for last-rotation timestamp; alert if the counter has not moved
# in the target window. Combine with an absent() check to catch fresh
# deployments where the counter simply hasn't fired yet.
expr: (increase(authserver_key_rotation_total[90d]) == 0) or absent(authserver_key_rotation_total)
for: 1h
labels:
severity: warning
annotations:
summary: "Signing key not rotated in > 90 days on {{ $labels.instance }}"
description: "Rotate via `authserver admin key rotate` or POST /admin/keys/rotate"
- alert: AuthPlaneUpstreamRefreshFailing
expr: rate(authserver_upstream_token_refresh_total{outcome="failed"}[15m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Upstream provider refresh failing on {{ $labels.instance }}"
description: "Broker refresh grant rejected by upstream — user may need to reconnect"
Grafana dashboard skeleton
Panels worth having on day one:
Full metric catalog with descriptions in Reference: Metrics & CLI.
Structured logging (slog)
AuthPlane logs via Go’s stdlib slog — JSON by default in production, plain text in dev. Every request emits:
2026-07-01T00:14:20Z INFO msg="token issued"
grant_type=authorization_code
client_id=my-client
sub=user-42
resource=https://mcp.example.com/mcp
scope="tools/read"
jti=jti_abc123
request_id=r_abc123
trace_id=t_def456
span_id=s_ghi789
Configure:
observability:
logging:
level: info # debug | info | warn | error
format: json # json | text
add_source: false # true = include file:line
outputs:
stdout: true # print to stdout (typical container pattern)
otel: false # ship via OTLP to a log backend
otel_endpoint: ""
insecure: false
Shipping to Loki, Elasticsearch, Splunk, or CloudWatch: use their standard container-log scraper (Promtail, Fluent Bit, CloudWatch agent) reading stdout. Every field is JSON-queryable.
OpenTelemetry — traces + logs + metrics
Wire an OTEL collector for distributed tracing across your MCP client → AuthPlane → your MCP server:
observability:
logging:
outputs:
otel: true
otel_endpoint: otel-collector.monitoring:4317
insecure: true # allow plaintext gRPC (fine inside cluster)
tracing:
enabled: true
endpoint: otel-collector.monitoring:4317
insecure: true
sample_rate: 1.0 # 0.0..1.0 — 1.0 = everything
metrics:
provider: both # scrape via /metrics AND push OTLP
otel_endpoint: otel-collector.monitoring:4317
insecure: true
Sample rate matters at scale — start at 1.0 for a week of observability, then drop to 0.01–0.1 and use tail-based sampling in the collector for latency outliers + errors.
What traces cover
Each incoming request generates a span with children for:
- HTTP handler
- Service-layer operation (
AuthorizeService.StartAuthorization,TokenService.ExchangeCode, etc.) - Storage-adapter operations (Postgres query, SQLite exec)
- Crypto operations (JWT signing, DPoP proof verification)
- Outbound HTTP (upstream OIDC callback, JWKS refresh, broker provider vend)
trace_id propagates from any inbound traceparent header (W3C Trace Context). Ship your MCP client’s traces to the same OTEL backend and you can see the full request path from client tool call → AuthPlane token issuance → your MCP server tool handler in one trace.
Health checks
GET /health # 200 OK / 503 (DB unreachable)
GET /ready # 200 OK / 503 (not ready to serve)
Both are unauthenticated and cheap. Wire to your container orchestrator’s liveness/readiness probes.
Kubernetes example:
livenessProbe:
httpGet:
path: /health
port: 9000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 9000
initialDelaySeconds: 5
periodSeconds: 10
Audit log
Every security-relevant event (login, consent, token issuance, revocation, admin action) writes an audit_events row queryable via:
GET /admin/audit?since=2026-06-01T00:00:00Z&kind=token_issued&user_id=user-42&limit=50
The audit records are exposed only over the admin API — there’s no dedicated admin audit … CLI subcommand. Use curl (or your admin client) against the endpoint above.
Combined with the structured logs, the audit log is the queryable long-term record. Structured logs are the ephemeral, high-cardinality stream; audit is the durable forensic record.
Related
- Reference: Metrics & CLI — every metric name, label, and description
- Reference: Configuration → observability — every knob
- Operate: Kubernetes — Helm-side wiring (
serviceMonitor,extraEnvfor OTEL) - Operate: Docker Compose — SIGHUP for hot key reload (useful in scripted rotation with monitoring)
- Troubleshooting: Debugging — using logs + metrics to diagnose failures