Concepts

Incidents

How Yorker groups correlated alerts into incidents, tracks their lifecycle, and dispatches opinionated, investigator-grade notifications.

Incidents

A Yorker incident is a correlated group of alerts treated as one investigable unit. Each incident has a fingerprint, a severity, a lifecycle, and a notification policy. Incidents reduce noise by collapsing many alerts into one ticket and by emitting structured, investigator-grade payloads to your channels.

Why incidents exist

A single alert answers "is this check failing right now?" It does not answer the question an on-call engineer actually needs: what is the blast radius, and is it related to something else that's breaking?

Synthetic monitors often fire in bursts. An upstream DNS provider hiccups and ten HTTP checks page at once. A CDN edge degrades and browser checks across three regions turn red. Without correlation, you get ten pages for one problem.

Yorker groups those alerts into an incident, computes a scoped hypothesis from the observations (HTTP status codes, locations, shared failing domains, symptom timing), and dispatches one ticket per channel per incident — not one per alert.

The incident lifecycle

Every incident moves through a small set of states. Each state transition is recorded as a first-class event and dispatched to subscribed channels.

StateEntered by
openCorrelated alerts above the score threshold
acknowledgedA user clicks "Acknowledge" in the dashboard or API
auto_resolvedAll member alerts recovered and the 15-minute cool-down elapsed
closedA user closes the incident explicitly
reopenedA user reopens a previously closed/resolved incident

The transient states reopenedopen are preserved in the event log so downstream consumers can replay the exact sequence.

Event types

Every lifecycle transition emits one of these events. Every event carries the full observations + hypothesis snapshot so a consumer replaying one event has complete context without querying back.

  • opened — new incident created
  • alert_attached — an additional alert joined an active incident
  • severity_changed — severity escalated or de-escalated
  • acknowledged — a user took ownership
  • note_added — a user added a freeform note
  • auto_resolved — all members recovered and cool-down elapsed
  • closed — a user closed it
  • reopened — a user reopened a previously resolved incident

Each event is persisted to incident_events, emitted as an OTel log record (if an OTLP endpoint is configured for the team), and dispatched to every channel subscribed to incidents for the team.

Default notification routing

Different channel types have different sensible defaults. Yorker opts into the minimum-noise routing that matches each channel's audience:

ChannelReceives by default
SlackEvery lifecycle event (timeline-style thread)
Emailopened, auto_resolved, closed only (inboxes should not be a running timeline)
WebhookEvery lifecycle event
PagerDutyopened, acknowledged, auto_resolved, closed, reopened, note_added
ServiceNowopened, severity_changed, acknowledged, auto_resolved, closed, note_added

PagerDuty skips severity_changed because the Events API v2 has no matching action. ServiceNow skips reopened because Yorker's reopen semantics don't map cleanly to ServiceNow's reopen concept — a Yorker "reopen" after a recurrence creates a new external ticket rather than mutating the old one.

See the Slack, PagerDuty, ServiceNow, Email, and Webhook integration pages for the exact payload shapes.

Scoped hypothesis

Every outbound incident payload carries a hypothesis block that tells the reader what Yorker thinks is going on — scoped to what an external synthetic sensor can prove:

{
  "hypothesis": {
    "summary": "Stripe API is returning 503/504; checkout is blocked.",
    "confidence": 0.75,
    "ruledIn": ["shared_failing_domain=api.stripe.com"],
    "ruledOut": [
      "DNS resolution: NXDOMAIN not observed",
      "TLS: handshake completes"
    ],
    "correlationDimensionsMatched": ["shared_failing_domain", "error_pattern"],
    "scope": "external_symptoms_only"
  }
}

scope: external_symptoms_only is the honesty baseline. Yorker can prove the external symptom — users cannot reach checkout — and can rule out classes of causes it directly measured (DNS, TLS, shared failing domains). It cannot see your backend logs, so it never claims the backend is the culprit.

Dedupe + rate limiting

  • 30s dedupe window — a retry firing the same event to the same channel within 30 seconds is recorded as skipped_dedupe in incident_notification_dispatches, not sent again.
  • 1-per-minute note rate limit — per (channel, incident), a second note_added within 60 seconds of a prior send attempt (successful or failed) is recorded as skipped_rate_limit. Failed attempts count because each one still hit the upstream endpoint — a flaky webhook returning 5xx must not leak a retry burst past the cap. Prevents an operator running a backfill script from spamming hundreds of notes.

Both checks fail open on database errors — losing a notification is worse than double-sending one.

User-editable templates

Every channel's default payload can be overridden with a Handlebars template attached to the notification channel. The rendering context matches serializeIncidentEventForExport plus a few helpers (severityEmoji, eventEmoji, join, ifHasSource, jsonBody).

A render error or JSON-parse failure on the override falls back to the default and logs — a bad template never fails dispatch.

Template overrides are sent via the notification-channel API:

curl -X PUT https://yorkermonitoring.com/api/notification-channels/nch_abc \
  -H "Authorization: Bearer $YORKER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "incidentTemplate": {
      "channelType": "slack",
      "overrides": {
        "opened": {
          "blocks": "{\"blocks\":[{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\"{{severityEmoji incident.severity}} {{incident.title}}\"}}]}"
        }
      }
    }
  }'

To disable a channel from receiving incident events (fall back to legacy per-alert dispatch), set incidentSubscribed: false on the channel.

Audit trail

Every dispatch writes one row to incident_notification_dispatches with status sent, skipped_dedupe, skipped_rate_limit, skipped_not_routed, or failed, plus any channel-specific response payload (PagerDuty dedup_key, ServiceNow sys_id). This is the source of truth for "did we actually notify?" — the UI will expose it in a later iteration.