- Key functions — rate, irate, increase, histogram_quantile, absent, absent_over_time, delta, predict_linear, age
- Aggregations — sum, count, topk, by/without, quantile_over_time, bool, offset
- Label manipulation — label_replace, label_join
- Useful queries — CPU %, pod restarts, error rate, SLO, memory, network by AZ, nodegroup, cardinality
- Relabeling tricks — drop metrics, series, labels
- Pre-commit validation — promtool, pint
- Architecture — agent mode, VictoriaMetrics, Thanos, Mimir
- Tools
Key functions
rate vs irate
rate calculates per-second increase averaged over the whole range. Use it for alerts and graphs — it’s smoother.
irate only looks at the last two data points. Very responsive but noisy. Avoid for alerting.
|
|
increase — total increase over the range. Only use for display, not for alerts. Use rate for alerts.
|
|
histogram_quantile — must apply rate first, then aggregate buckets, then quantile.
|
|
absent — returns 1 with labels from the selector if no time series match. Use it to alert on missing metrics.
|
|
predict_linear — least-squares prediction. Good for disk-full alerts.
|
|
delta — absolute change over the interval. For gauges only, not counters.
|
|
absent_over_time — like absent, but fires when a range vector is entirely empty. Useful for detecting metric gaps, not just missing series.
|
|
age calculation — how long since something last succeeded:
|
|
Aggregations
|
|
without excludes listed labels, by keeps only listed labels:
|
|
quantile_over_time — aggregate a time series over past data:
|
|
Boolean filtering — > bool returns 0/1 instead of filtering. Useful in recording rules and arithmetic:
|
|
Offset — compare current value against the past. Useful for week-over-week dashboards:
|
|
Label manipulation
label_replace — set a label via regex:
|
|
label_join — join multiple label values into one:
|
|
Note: label names with hyphens (e.g. node.kubernetes.io/instance-type) become underscores in __meta_* labels: __meta_kubernetes_node_label_node_kubernetes_io_instance_type.
Useful queries
Memory — use working_set_bytes, not rss. This is what the OOMkiller sees.
|
|
Network traffic by namespace and AZ:
|
|
EKS nodegroup memory usage:
|
|
CPU utilization %:
|
|
Pod restarts:
|
|
Error rate (5xx):
|
|
SLO availability over 30 days:
|
|
Grafana tip: use $__range to match the dashboard time range automatically:
|
|
Cardinality — find expensive metrics:
|
|
Allow AZ label on nodes for kube-state-metrics:
|
|
Relabeling tricks
Drop whole metrics or time series you don’t need — saves cardinality:
|
|
Pre-commit hooks for rules validation
Catch broken configs and rules before they reach the cluster:
|
|
Also worth using: pint — a linter for PromQL rules.
Architecture notes
Prometheus agent mode — scrapes and remote-writes, no local storage or querying. Use it when you only need forwarding to a central TSDB.
Long-term storage options:
- VictoriaMetrics — efficient compression, global query view, low resource usage. Cluster version supports remote storage.
vmagentas a lightweight drop-in for scraping. - Thanos — S3-backed long-term storage, deduplication for HA pairs, global querying. More complex to operate.
- Mimir — Grafana’s horizontally scalable Prometheus backend.
Common pattern: one Prometheus per cluster scraping locally + remote write to centralised Mimir/Victoria, single Grafana on top.
Useful tools
| Tool | What it does |
|---|---|
| promlens.com | Visual PromQL explainer |
| pint | PromQL linter |
| sloth | SLO-based alerting rules generator |
| kthxbye | Auto-extend alertmanager silences |
| karma | Alertmanager dashboard |
| awesome-prometheus-alerts | Community alert rules library |
| vscode-promql | VS Code plugin for writing alert rules |