Monitoring & observability

TL;DR — AuthPlane emits Prometheus metrics on /metrics by default, structured slog with trace/span/request IDs, and optional OpenTelemetry traces + metrics via OTLP. This guide has the Prometheus scrape config, six alert rules worth pasting straight in, a Grafana dashboard skeleton, and the OTEL config for the collector. Metrics catalog + names live in Reference: Metrics & CLI.

Prometheus scrape

AuthPlane exposes /metrics on the admin port (9001 by default; registered ahead of the admin API-key auth, so scrapers don’t need credentials) in Prometheus text format. Path and provider are configurable:

observability:
  metrics:
    provider: prometheus     # "prometheus" | "otel" | "both" | "none"
    path: /metrics

Sample prometheus.yml:

scrape_configs:
  - job_name: authplane
    metrics_path: /metrics
    scrape_interval: 15s
    basic_auth:
      username: metrics                  # any non-empty username
      password: ${AUTHPLANE_ADMIN_API_KEY}
    static_configs:
      - targets: ['authplane:9001']

For Kubernetes with the AuthPlane Helm chart, the chart ships a ServiceMonitor (Prometheus Operator CRD):

serviceMonitor:
  enabled: true
  interval: 15s

Six alerts worth having

Start here, tune thresholds to your traffic.

groups:
- name: authplane
  rules:

  - alert: AuthPlaneAuthDeniedSpike
    expr: rate(authserver_auth_denied_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Auth denials spiking on {{ $labels.instance }}"
      description: "> 10 auth denials/sec over 5 min — possible attack or misconfig"

  - alert: AuthPlaneRefreshTokenReuse
    expr: rate(authserver_refresh_token_reuse_total[5m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Refresh-token theft detected on {{ $labels.instance }}"
      description: "Refresh-token family revocation triggered — investigate immediately"

  - alert: AuthPlaneDPoPRejectionSpike
    expr: rate(authplane_dpop_proofs_rejected_total[5m]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "DPoP rejections spiking on {{ $labels.instance }}"
      description: "> 5 DPoP proof rejections/sec — client bug, replay attack, or reverse-proxy htu mismatch"

  - alert: AuthPlaneTokenIssuanceSlow
    expr: histogram_quantile(0.99, rate(authserver_token_issuance_duration_seconds_bucket[5m])) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "p99 token issuance > 500 ms on {{ $labels.instance }}"
      description: "DB or Vault Transit latency degrading the token endpoint"

  - alert: AuthPlaneKeyRotationStale
    # No gauge for last-rotation timestamp; alert if the counter has not moved
    # in the target window. Combine with an absent() check to catch fresh
    # deployments where the counter simply hasn't fired yet.
    expr: (increase(authserver_key_rotation_total[90d]) == 0) or absent(authserver_key_rotation_total)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Signing key not rotated in > 90 days on {{ $labels.instance }}"
      description: "Rotate via `authserver admin key rotate` or POST /admin/keys/rotate"

  - alert: AuthPlaneUpstreamRefreshFailing
    expr: rate(authserver_upstream_token_refresh_total{outcome="failed"}[15m]) > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Upstream provider refresh failing on {{ $labels.instance }}"
      description: "Broker refresh grant rejected by upstream — user may need to reconnect"

Grafana dashboard skeleton

Panels worth having on day one:

Panel	PromQL
Requests/sec by grant type	`sum by (grant_type) (rate(authserver_tokens_issued_total[5m]))`
Auth denials/sec	`rate(authserver_auth_denied_total[5m])`
Token issuance p50/p95/p99	`histogram_quantile(0.99, rate(authserver_token_issuance_duration_seconds_bucket[5m]))`
DPoP validated vs rejected	`rate(authplane_dpop_proofs_validated_total[5m])` and `..._rejected_total` on same axis
Introspection latency	`histogram_quantile(0.95, rate(authserver_introspection_duration_seconds_bucket[5m]))`
Refresh-token rotations	`rate(authserver_tokens_refreshed_total[5m])`
Reuse-detected revocations	`rate(authserver_refresh_token_reuse_total[5m])` — should be near-zero
Upstream vends/sec	`rate(authserver_upstream_token_issued_total[5m])` per Broker resource
Active token families	`authserver_active_token_families`
HTTP request rate by status	`sum by (status) (rate(authserver_http_requests_total[5m]))`

Full metric catalog with descriptions in Reference: Metrics & CLI.

Structured logging (slog)

AuthPlane logs via Go’s stdlib slog — JSON by default in production, plain text in dev. Every request emits:

2026-07-01T00:14:20Z INFO msg="token issued"
  grant_type=authorization_code
  client_id=my-client
  sub=user-42
  resource=https://mcp.example.com/mcp
  scope="tools/read"
  jti=jti_abc123
  request_id=r_abc123
  trace_id=t_def456
  span_id=s_ghi789

Configure:

observability:
  logging:
    level: info            # debug | info | warn | error
    format: json           # json | text
    add_source: false      # true = include file:line
    outputs:
      stdout: true         # print to stdout (typical container pattern)
      otel: false          # ship via OTLP to a log backend
      otel_endpoint: ""
      insecure: false

Shipping to Loki, Elasticsearch, Splunk, or CloudWatch: use their standard container-log scraper (Promtail, Fluent Bit, CloudWatch agent) reading stdout. Every field is JSON-queryable.

OpenTelemetry — traces + logs + metrics

Wire an OTEL collector for distributed tracing across your MCP client → AuthPlane → your MCP server:

observability:
  logging:
    outputs:
      otel: true
      otel_endpoint: otel-collector.monitoring:4317
      insecure: true         # allow plaintext gRPC (fine inside cluster)
  tracing:
    enabled: true
    endpoint: otel-collector.monitoring:4317
    insecure: true
    sample_rate: 1.0         # 0.0..1.0 — 1.0 = everything
  metrics:
    provider: both           # scrape via /metrics AND push OTLP
    otel_endpoint: otel-collector.monitoring:4317
    insecure: true

Sample rate matters at scale — start at 1.0 for a week of observability, then drop to 0.01–0.1 and use tail-based sampling in the collector for latency outliers + errors.

What traces cover

Each incoming request generates a span with children for:

HTTP handler
Service-layer operation (AuthorizeService.StartAuthorization, TokenService.ExchangeCode, etc.)
Storage-adapter operations (Postgres query, SQLite exec)
Crypto operations (JWT signing, DPoP proof verification)
Outbound HTTP (upstream OIDC callback, JWKS refresh, broker provider vend)

trace_id propagates from any inbound traceparent header (W3C Trace Context). Ship your MCP client’s traces to the same OTEL backend and you can see the full request path from client tool call → AuthPlane token issuance → your MCP server tool handler in one trace.

Health checks

GET /health           # 200 OK / 503 (DB unreachable)
GET /ready            # 200 OK / 503 (not ready to serve)

Both are unauthenticated and cheap. Wire to your container orchestrator’s liveness/readiness probes.

Kubernetes example:

livenessProbe:
  httpGet:
    path: /health
    port: 9000
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /ready
    port: 9000
  initialDelaySeconds: 5
  periodSeconds: 10

Audit log

Every security-relevant event (login, consent, token issuance, revocation, admin action) writes an audit_events row queryable via:

GET /admin/audit?since=2026-06-01T00:00:00Z&kind=token_issued&user_id=user-42&limit=50

The audit records are exposed only over the admin API — there’s no dedicated admin audit … CLI subcommand. Use curl (or your admin client) against the endpoint above.

Combined with the structured logs, the audit log is the queryable long-term record. Structured logs are the ephemeral, high-cardinality stream; audit is the durable forensic record.

Reference: Metrics & CLI — every metric name, label, and description
Reference: Configuration → observability — every knob
Operate: Kubernetes — Helm-side wiring (serviceMonitor, extraEnv for OTEL)
Operate: Docker Compose — SIGHUP for hot key reload (useful in scripted rotation with monitoring)
Troubleshooting: Debugging — using logs + metrics to diagnose failures