Runbooks

Incident procedures for the fabriq worker — outbox backlog, projection lag, tenant hook trips, conflation depth, and SSE buffering behind proxies.

Each procedure keys off a metric from Observability. Metric names are exact.

Outbox backlog grows

Metric: fabriq_outbox_backlog. Cause: the relay is not draining — the worker is down, no replica holds leadership, or Redis is unreachable.

Confirm worker replicas are running and ready.

kubectl get pods -l app.kubernetes.io/component=worker
kubectl exec <pod> -- wget -qO- localhost:8081/_/readyz

Confirm exactly one relay leads (advisory lock 1001). Check worker logs for election churn; a flapping Postgres connection abdicates leadership by design (the session watchdog), so persistent flapping points at the Postgres connection, not the relay.

Check Redis reachability from the worker. The relay is at-least-once: once Redis returns, the backlog drains in ULID order. Downstream consumers are version-gated and idempotent, so the duplicates this can produce are harmless.

Commands never fail because the relay is down — the outbox is the buffer, so writes still commit. This is a latency incident, not a data-loss incident.

Projection lag

Metric: fabriq_projection_lag_events{projection,tenant}.

Identify the lagging consumer group on the Redis event stream.

XINFO GROUPS fabriq:events

Consumers scale by replica count (Redis consumer groups, no election) — add worker replicas if apply throughput is the limit.

Look for a poisoned event: a handler error loop leaves an entry pending, which is XAUTOCLAIMed between consumers. The same stream id cycling in logs is the tell. Because appliers are pure and version-gated, the usual cause is an engine outage, not the event itself — check the target engine.

If a projection fell off the stream's MAXLEN horizon, replay from the stream is no longer possible. Rebuild from Postgres — always safe, always possible. See Rebuild & Reconcile.

Tenant hook trips

Metric: fabriq_tenant_hook_trips_total.

Any non-zero value is a fabriq bug — page the owning team. It means a query reached an engine without tenant scoping: pool-path access to a tenant table, or raw SQL touching the readings table without a tenant predicate. RLS contained the blast radius (stamped transactions see only their tenant; unstamped see nothing), but the call site must be found and fixed. The error carries the offending table and operation.

This is a correctness alarm, not a capacity one — do not "wait for it to drain."

Conflation depth grows

Metric: fabriq_conflation_depth. Subscribers are not draining.

Hub delivery is non-blocking: a full buffer drops, and clients refetch and resume via Last-Event-ID. So this self-heals at the cost of client refetches.

Investigate slow SSE consumers — a stuck client that holds a subscription without reading backs up its conflation window.

If depth is broadly elevated rather than per-client, raise the hub buffer (WithSubscribeBuffer) and redeploy.

SSE batched behind a proxy

Symptom: clients receive events in batches instead of as they happen — a proxy is buffering the stream.

The SSE bridge sets X-Accel-Buffering: no, flushes after every event, and writes periodic heartbeat comments to keep intermediaries from idling the connection. If events still arrive batched:

Check nginx honors the header — proxy_buffering off is the effect of X-Accel-Buffering: no. Confirm no intermediate proxy strips it.

Check the load-balancer idle timeout is greater than the heartbeat interval, or the connection is closed between events (ALB and similar).

Observability

The full metric catalog and what each signals.

Rebuild & Reconcile

Rebuild recovers a projection that fell off the stream horizon.