Alerting Rules
Alerting is a critical part of any monitoring setup. The PCA exam covers both Prometheus alerting rules and the basics of Alertmanager configuration.
Architecture
Prometheus alerting works in two stages:
- Prometheus server evaluates alerting rules and sends firing alerts to Alertmanager
- Alertmanager handles deduplication, grouping, routing, and notification delivery
Prometheus ──(alerts)──> Alertmanager ──(notifications)──> Email/Slack/PagerDuty
Alerting Rules
Alerting rules are defined in rule files and loaded via the Prometheus configuration:
# prometheus.yml
rule_files:
- "rules/*.yml"
Rule File Format
groups:
- name: example-alerts
interval: 30s # Override evaluation interval (optional)
rules:
- alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "95th percentile latency is {{ $value }}s (threshold: 0.5s)"
Rule Fields
| Field | Description |
|---|---|
alert |
Alert name (must be unique within a group) |
expr |
PromQL expression that triggers the alert |
for |
Duration the expression must be true before firing |
labels |
Additional labels to attach to the alert |
annotations |
Informational labels (summary, description) for notifications |
The for Clause
The for clause prevents alerts from firing on brief spikes:
- Pending: Expression is true but
forduration has not elapsed - Firing: Expression has been true for the entire
forduration - Resolved: Expression is no longer true
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
This only fires if the instance has been down for 5 consecutive minutes.
Alertmanager Configuration
Basic Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: "default"
group_by: ["alertname", "job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: "default"
email_configs:
- to: "team@example.com"
Routing Tree
Alertmanager uses a tree-based routing structure. Each alert is matched against routes from top to bottom:
route:
receiver: "default"
routes:
- match:
severity: critical
receiver: "pagerduty"
- match:
severity: warning
receiver: "slack"
- match_re:
service: "^(web|api)$"
receiver: "team-backend"
Key Parameters
| Parameter | Description |
|---|---|
group_by |
Labels to group alerts by (reduces notification noise) |
group_wait |
Wait time before sending first notification for a new group |
group_interval |
Wait time between notifications for a group |
repeat_interval |
Wait time before re-sending a notification |
continue |
If true, continue matching subsequent sibling routes |
Inhibition Rules
Suppress alerts when related alerts are firing:
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
This suppresses warning alerts when a critical alert with the same alertname and instance is firing.
Silences
Silences mute alerts for a given time period. They are managed through the Alertmanager web UI or API, not through configuration files.
Recording Rules
Recording rules precompute frequently used or expensive expressions:
groups:
- name: request-rates
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Naming convention: level:metric:operations (e.g., job:http_requests_total:rate5m)
Recording rules:
- Reduce query latency for dashboards
- Are evaluated at the
evaluation_interval - Store results as new time series
Connecting Prometheus to Alertmanager
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
Key Exam Tips
- for clause: An alert without
forfires immediately when the expression is true. Withfor, it first enters "pending" state. - Alertmanager grouping:
group_byis critical for reducing notification noise. Group byalertnameat minimum. - Recording rules vs alerting rules: Recording rules use
record, alerting rules usealert. Both live in rule files. - Template variables: In annotations, use
{{ $labels.labelname }}for labels and{{ $value }}for the expression value. - Resolve notifications: Alertmanager sends a resolve notification when an alert stops firing (after
resolve_timeout). - Rule evaluation order: Rules within a group are evaluated sequentially. Groups are evaluated concurrently.