Navigation

Getting Started

Quick Start

Guides

Concepts

Integrations

Reference

Concepts

Incidents

How Yorker groups correlated alerts into incidents, tracks their lifecycle, and dispatches opinionated, investigator-grade notifications.

Incidents

A Yorker incident is a correlated group of alerts treated as one investigable unit. Each incident has a fingerprint, a severity, a lifecycle, and a notification policy. Incidents reduce noise by collapsing many alerts into one ticket and by emitting structured, investigator-grade payloads to your channels.

Why incidents exist

Synthetic monitors fire in bursts. An upstream DNS provider hiccups and ten HTTP checks page at once. A CDN edge degrades and browser checks across three regions turn red. Without correlation, that's ten pages for one problem, and none of them say anything about the blast radius.

Yorker groups those alerts into an incident, computes a scoped hypothesis from the observations (HTTP status codes, locations, shared failing domains, symptom timing), and dispatches one ticket per channel per incident, not one per alert.

The incident lifecycle

Every incident moves through a small set of states. Each state transition is recorded as a first-class event and dispatched to subscribed channels.

State	Entered by
`open`	Correlated alerts above the score threshold
`acknowledged`	A user clicks "Acknowledge" in the dashboard or API
`auto_resolved`	All member alerts recovered and the 15-minute cool-down elapsed
`closed`	A user closes the incident explicitly
`reopened`	A user reopens a previously closed/resolved incident

The transient states reopened → open are preserved in the event log so downstream consumers can replay the exact sequence.

Event types

Every lifecycle transition emits one of these events. Every event carries the full observations + hypothesis snapshot so a consumer replaying one event has complete context without querying back.

opened: new incident created
alert_attached: an additional alert joined an active incident
severity_changed: severity escalated or de-escalated
acknowledged: a user took ownership
note_added: a user added a freeform note
auto_resolved: all members recovered and cool-down elapsed
closed: a user closed it
reopened: a user reopened a previously resolved incident

Each event is persisted to incident_events, emitted as an OTel log record (if an OTLP endpoint is configured for the team), and dispatched to every channel subscribed to incidents for the team.

Default notification routing

Different channel types have different sensible defaults. Yorker opts into the minimum-noise routing that matches each channel's audience:

Channel	Receives by default
Slack	Every lifecycle event (timeline-style thread)
Email	`opened`, `auto_resolved`, `closed` only (inboxes should not be a running timeline)
Webhook	Every lifecycle event
PagerDuty	`opened`, `acknowledged`, `auto_resolved`, `closed`, `reopened`, `note_added`
ServiceNow	`opened`, `severity_changed`, `acknowledged`, `auto_resolved`, `closed`, `note_added`

PagerDuty skips severity_changed because the Events API v2 has no matching action. ServiceNow skips reopened because Yorker's reopen semantics don't map cleanly to ServiceNow's reopen concept: a Yorker "reopen" after a recurrence creates a new external ticket rather than mutating the old one.

See the Slack, PagerDuty, ServiceNow, Email, and Webhook integration pages for the exact payload shapes.

Scoped hypothesis

Every outbound incident payload carries a hypothesis block that tells the reader what Yorker thinks is going on, scoped to what an external synthetic sensor can prove:

{
  "hypothesis": {
    "summary": "Stripe API is returning 503/504; checkout is blocked.",
    "confidence": 0.75,
    "ruledIn": ["shared_failing_domain=api.stripe.com"],
    "ruledOut": [
      "DNS resolution: NXDOMAIN not observed",
      "TLS: handshake completes"
    ],
    "correlationDimensionsMatched": ["shared_failing_domain", "error_pattern"],
    "scope": "external_symptoms_only"
  }
}

scope: external_symptoms_only is the honesty baseline. Yorker can prove the external symptom (users cannot reach checkout) and can rule out classes of causes it directly measured (DNS, TLS, shared failing domains). It cannot see your backend logs, so it never claims the backend is the culprit.

Cross-monitor correlation OTel event

In parallel with the incident pipeline, failing browser-check results that share a third-party dependency with other recent failing browser-check results in the same team produce a synthetics.correlation.detected OTLP log event. (HTTP and MCP checks do not populate network-summary data, so the event does not fire for those check types.) The event is dispatched via the standard outbox path to your configured OTLP endpoint.

The incident's scoped hypothesis (hypothesis.correlationDimensionsMatched) and the OTel event are computed from the same evidence in the same result-ingest pipeline. Neither is sequenced relative to the other, and an incident may open with or without the OTel event firing. The OTel event represents a co-occurrence slice across a 5-minute window; the incident represents the alert correlation Yorker performs separately for paging.

See OpenTelemetry concepts for the event's full attribute reference, threshold rules, and dedup contract.

Dedupe + rate limiting

30s dedupe window: a retry firing the same event to the same channel within 30 seconds is recorded as skipped_dedupe in incident_notification_dispatches, not sent again.
1-per-minute note rate limit: per (channel, incident), a second note_added within 60 seconds of a prior send attempt (successful or failed) is recorded as skipped_rate_limit. Failed attempts count because each one still hit the upstream endpoint: a flaky webhook returning 5xx must not leak a retry burst past the cap. Prevents an operator running a backfill script from spamming hundreds of notes.

Both checks fail open on database errors. Losing a notification is worse than double-sending one.

User-editable templates

Every channel's default payload can be overridden with a Handlebars template attached to the notification channel. The rendering context matches serializeIncidentEventForExport plus a few helpers (severityEmoji, eventEmoji, join, ifHasSource, jsonBody).

A render error or JSON-parse failure on the override falls back to the default and logs; a bad template never fails dispatch.

In the web UI

For Slack, email, and webhook channels, Settings > Notification Channels > Templates opens a full editor with per-event tabs, a live preview rendered against canonical fixtures, a library of starter and example templates, a diff view comparing the draft against the last saved version, and a Send test button that dispatches the current saved template to the real channel. The editor is the recommended authoring path for these three channel types. PagerDuty and ServiceNow overrides are currently API-only.

Via the API

Template overrides are sent via the notification-channel API:

curl -X PUT https://yorkermonitoring.com/api/notification-channels/nch_abc \
  -H "Authorization: Bearer $YORKER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "incidentTemplate": {
      "channelType": "slack",
      "overrides": {
        "opened": {
          "blocks": "{\"blocks\":[{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\"{{severityEmoji incident.severity}} {{incident.title}}\"}}]}"
        }
      }
    }
  }'

To disable a channel from receiving incident events (fall back to legacy per-alert dispatch), set incidentSubscribed: false on the channel.

Audit trail

Every dispatch writes one row to incident_notification_dispatches with status sent, skipped_dedupe, skipped_rate_limit, skipped_not_routed, or failed, plus any channel-specific response payload (PagerDuty dedup_key, ServiceNow sys_id). This is the source of truth for "did we actually notify?". The UI will expose it in a later iteration.

Per-dependency anomaly detection

Browser checks record every sub-request made during a page load. Over time, Yorker builds per-resource baselines (mean and standard deviation of duration, transfer size, and request count) keyed by resource URL and time-of-day slot (hour + day of week).

When a new result arrives, each resource is compared against its baseline. If the observed value exceeds its baseline mean by more than 3 standard deviations (z > 3, one-sided) AND the baseline has accumulated at least 10 samples for the (hour, day-of-week) bucket, Yorker flags the resource as an anomaly and writes a finding to dependency_changes.

Detection is one-sided: only regressions (observed value above baseline) are flagged. A resource that is faster than its baseline is not an anomaly in the alerting sense.

Anomaly types

The detector writes the following change types to dependency_changes:

Change type	Description
`duration_shift`	Response time exceeded baseline mean by more than 3-sigma (regressions only)
`size_shift`	Transfer size exceeded baseline mean by more than 3-sigma
`count_shift`	Request count exceeded baseline mean by more than 3-sigma
`new`	Resource appeared for the first time on a check that has prior runs
`removed`	Resource disappeared (not seen within 2x check frequency)

Today, only duration_shift findings are surfaced in the incident analysis context. The other types are persisted so future surfaces (per-resource UI, dedicated dashboards) can render them, but they do not appear in the LLM analysis prompt yet.

Insight wording

When duration_shift anomalies are found, the incident analysis context renders one human-readable finding per anomalous dependency, capped at 10 findings ordered by largest regression first, in the form:

Third-party dependency cdn.vendor.com/analytics.js response time increased to 1.2s from 90ms baseline.

Each finding is rendered independently. A contribution percentage ("this single resource accounts for N% of the total response time increase") is intentionally omitted at incident scope because findings can span multiple checks and multiple run timestamps within the incident window, so a shared denominator across the set would not be mathematically meaningful. The contribution sentence is reserved for future per-(checkId, detectedAt) surfaces.

Speedups (below-baseline durations) are not surfaced as duration_shift findings.

Warm-up period

Baselines are computed per-bucket: one mean and standard deviation for each (hour-of-day, day-of-week) slot. A 14-day rolling window means the maximum samples per bucket is 2 * runsPerHour. Anomaly detection activates once a bucket has at least 10 samples.

For 5-minute checks (12 runs/hour), each bucket can hold up to 24 samples. Detection typically activates after about 5 days of stable history in a given hour-of-week slot.

For hourly checks (1 run/hour), each bucket holds at most 2 samples across the 14-day window. That bucket never reaches the 10-sample floor, so anomaly detection does not emit findings for hourly checks. This is intentional: hourly check cadence is too coarse for sub-second response-time anomaly detection.

During warm-up, new-resource and removed-resource changes are still flagged because they are structural events that do not depend on statistical baselines.

Alert Correlation Events