Alert Correlation

How multi-location correlation and OTel trace linking reduce noise and speed up root cause analysis.

Alert Correlation

Synthetic monitors generate a lot of signals. Not every failure is a real outage -- network glitches, regional ISP issues, and transient errors produce false positives. Yorker uses multi-location correlation and consecutive failure thresholds to separate real incidents from noise, and OTel trace linking to get you from alert to root cause in one click.

The noise problem

A single-location failure usually means nothing. A DNS resolver in Frankfurt hiccups for 200ms. A CDN edge node in Sydney drops a connection. If you alert on every individual failure, you get paged for problems your users never notice. The signal worth paging on is whether the service is actually down across enough vantage points to matter, not whether one check failed once.

Multi-location correlation

The multi_location_failure condition answers that question. It requires N of M monitoring locations to report failure within a time window before triggering an alert.

For example, if your check runs from 6 locations and you configure minLocations: 3, the alert only fires when at least 3 locations fail in the same window. A single location flaking does not page you.

alerts:
  - name: Homepage Down
    conditions:
      - type: multi_location_failure
        minLocations: 3
    channels:
      - "@pagerduty-oncall"

This eliminates geographic noise. If only Tokyo fails but Ashburn, London, Frankfurt, Singapore, and Sydney are all passing, the problem is regional -- not an outage.

Consecutive failure thresholds

The consecutive_failures condition handles a different class of noise: transient blips. A single timeout or 503 that resolves on the next check interval is not worth alerting on.

alerts:
  - name: API Degraded
    conditions:
      - type: consecutive_failures
        count: 5
    channels:
      - "@ops-slack"

This alert only fires after 5 checks in a row fail. A one-off timeout is silently recorded in the check history but does not trigger a notification.

Multi-tier alerting

Combine both conditions to build alert tiers that match your incident response workflow:

alerts:
  # Critical: multiple locations confirm the outage
  - name: Service Outage
    conditions:
      - type: multi_location_failure
        minLocations: 3
    channels:
      - "@pagerduty-oncall"

  # Warning: persistent failures from any location
  - name: Service Degraded
    conditions:
      - type: consecutive_failures
        count: 5
    channels:
      - "@ops-slack"

  # Info: SSL certificate expiring soon
  - name: SSL Expiry Warning
    conditions:
      - type: ssl_expiry
        daysBeforeExpiry: 14
    channels:
      - "@on-call-email"

Critical alerts go to PagerDuty because multiple locations confirm the service is down. Warning alerts go to Slack because the issue is persistent but might be localized. Info alerts go to email for non-urgent action items.

OTel trace linking

When a check fails, the trace ID from that execution links directly to the distributed trace in your observability backend. The flow looks like this:

Runner executes the check and injects a traceparent header.
Your backend processes the request and records the trace.
The check fails (assertion failure, timeout, 5xx response).
Yorker creates an alert with the trace ID attached.
You click the trace link in the alert notification.
Your observability backend shows the full distributed trace: the synthetic request, your API handler, the database query that timed out, the error.

This collapses the "what broke?" investigation from minutes of log searching to a single click. The synthetic check and the backend error are part of the same trace.

The alerts dashboard shows all active, acknowledged, and recovered alerts across your monitors.

Alert lifecycle

Alerts follow a state machine:

State	Meaning
ACTIVE	The alert condition is met. Notifications have been sent.
ACKNOWLEDGED	A team member has acknowledged the alert. No repeat notifications.
RESOLVED	A team member manually resolved the alert.
RECOVERED	The check started passing again. The alert auto-resolves.

When a check that triggered an ACTIVE alert starts passing again, the alert transitions to RECOVERED and a recovery notification is sent to the same channels. This closes the loop without manual intervention.

Acknowledged alerts suppress repeat notifications but remain visible in the dashboard until the underlying issue is resolved or the check recovers.

OpenTelemetry Incidents