Navigation
Getting Started
Guides
Integrations
Concepts
Incidents
How Yorker groups correlated alerts into incidents, tracks their lifecycle, and dispatches opinionated, investigator-grade notifications.
Incidents
A Yorker incident is a correlated group of alerts treated as one investigable unit. Each incident has a fingerprint, a severity, a lifecycle, and a notification policy. Incidents reduce noise by collapsing many alerts into one ticket and by emitting structured, investigator-grade payloads to your channels.
Why incidents exist
A single alert answers "is this check failing right now?" It does not answer the question an on-call engineer actually needs: what is the blast radius, and is it related to something else that's breaking?
Synthetic monitors often fire in bursts. An upstream DNS provider hiccups and ten HTTP checks page at once. A CDN edge degrades and browser checks across three regions turn red. Without correlation, you get ten pages for one problem.
Yorker groups those alerts into an incident, computes a scoped hypothesis from the observations (HTTP status codes, locations, shared failing domains, symptom timing), and dispatches one ticket per channel per incident — not one per alert.
The incident lifecycle
Every incident moves through a small set of states. Each state transition is recorded as a first-class event and dispatched to subscribed channels.
| State | Entered by |
|---|---|
open | Correlated alerts above the score threshold |
acknowledged | A user clicks "Acknowledge" in the dashboard or API |
auto_resolved | All member alerts recovered and the 15-minute cool-down elapsed |
closed | A user closes the incident explicitly |
reopened | A user reopens a previously closed/resolved incident |
The transient states reopened → open are preserved in the event log so downstream consumers can replay the exact sequence.
Event types
Every lifecycle transition emits one of these events. Every event carries the full observations + hypothesis snapshot so a consumer replaying one event has complete context without querying back.
opened— new incident createdalert_attached— an additional alert joined an active incidentseverity_changed— severity escalated or de-escalatedacknowledged— a user took ownershipnote_added— a user added a freeform noteauto_resolved— all members recovered and cool-down elapsedclosed— a user closed itreopened— a user reopened a previously resolved incident
Each event is persisted to incident_events, emitted as an OTel log record (if an OTLP endpoint is configured for the team), and dispatched to every channel subscribed to incidents for the team.
Default notification routing
Different channel types have different sensible defaults. Yorker opts into the minimum-noise routing that matches each channel's audience:
| Channel | Receives by default |
|---|---|
| Slack | Every lifecycle event (timeline-style thread) |
opened, auto_resolved, closed only (inboxes should not be a running timeline) | |
| Webhook | Every lifecycle event |
| PagerDuty | opened, acknowledged, auto_resolved, closed, reopened, note_added |
| ServiceNow | opened, severity_changed, acknowledged, auto_resolved, closed, note_added |
PagerDuty skips severity_changed because the Events API v2 has no matching action. ServiceNow skips reopened because Yorker's reopen semantics don't map cleanly to ServiceNow's reopen concept — a Yorker "reopen" after a recurrence creates a new external ticket rather than mutating the old one.
See the Slack, PagerDuty, ServiceNow, Email, and Webhook integration pages for the exact payload shapes.
Scoped hypothesis
Every outbound incident payload carries a hypothesis block that tells the reader what Yorker thinks is going on — scoped to what an external synthetic sensor can prove:
{
"hypothesis": {
"summary": "Stripe API is returning 503/504; checkout is blocked.",
"confidence": 0.75,
"ruledIn": ["shared_failing_domain=api.stripe.com"],
"ruledOut": [
"DNS resolution: NXDOMAIN not observed",
"TLS: handshake completes"
],
"correlationDimensionsMatched": ["shared_failing_domain", "error_pattern"],
"scope": "external_symptoms_only"
}
}scope: external_symptoms_only is the honesty baseline. Yorker can prove the external symptom — users cannot reach checkout — and can rule out classes of causes it directly measured (DNS, TLS, shared failing domains). It cannot see your backend logs, so it never claims the backend is the culprit.
Dedupe + rate limiting
- 30s dedupe window — a retry firing the same event to the same channel within 30 seconds is recorded as
skipped_dedupeinincident_notification_dispatches, not sent again. - 1-per-minute note rate limit — per (channel, incident), a second
note_addedwithin 60 seconds of a prior send attempt (successful or failed) is recorded asskipped_rate_limit. Failed attempts count because each one still hit the upstream endpoint — a flaky webhook returning 5xx must not leak a retry burst past the cap. Prevents an operator running a backfill script from spamming hundreds of notes.
Both checks fail open on database errors — losing a notification is worse than double-sending one.
User-editable templates
Every channel's default payload can be overridden with a Handlebars template attached to the notification channel. The rendering context matches serializeIncidentEventForExport plus a few helpers (severityEmoji, eventEmoji, join, ifHasSource, jsonBody).
A render error or JSON-parse failure on the override falls back to the default and logs — a bad template never fails dispatch.
Template overrides are sent via the notification-channel API:
curl -X PUT https://yorkermonitoring.com/api/notification-channels/nch_abc \
-H "Authorization: Bearer $YORKER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"incidentTemplate": {
"channelType": "slack",
"overrides": {
"opened": {
"blocks": "{\"blocks\":[{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\"{{severityEmoji incident.severity}} {{incident.title}}\"}}]}"
}
}
}
}'To disable a channel from receiving incident events (fall back to legacy per-alert dispatch), set incidentSubscribed: false on the channel.
Audit trail
Every dispatch writes one row to incident_notification_dispatches with status sent, skipped_dedupe, skipped_rate_limit, skipped_not_routed, or failed, plus any channel-specific response payload (PagerDuty dedup_key, ServiceNow sys_id). This is the source of truth for "did we actually notify?" — the UI will expose it in a later iteration.