- Key functions — rate, irate, increase, histogram_quantile, absent, absent_over_time, delta, predict_linear, age
- Aggregations — sum, count, topk, by/without, quantile_over_time, bool, offset
- Label manipulation — label_replace, label_join
- Useful queries — CPU %, pod restarts, error rate, SLO, memory, network by AZ, nodegroup, cardinality
- Relabeling tricks — drop metrics, series, labels
- Pre-commit validation — promtool, pint
- Architecture — agent mode, VictoriaMetrics, Thanos, Mimir
- Tools
Key functions
rate vs irate
rate calculates per-second increase averaged over the whole range. Use it for alerts and graphs — it’s smoother.
irate only looks at the last two data points. Very responsive but noisy. Avoid for alerting.
rate(http_requests_total[5m])
irate(http_requests_total[5m])
increase — total increase over the range. Only use for display, not for alerts. Use rate for alerts.
increase(process_cpu_seconds_total[15m])
histogram_quantile — must apply rate first, then aggregate buckets, then quantile.
histogram_quantile(0.95,
rate(request_duration_seconds_bucket{status_code="401"}[10m])
)
absent — returns 1 with labels from the selector if no time series match. Use it to alert on missing metrics.
absent(up{job="node"})
predict_linear — least-squares prediction. Good for disk-full alerts.
predict_linear(node_filesystem_free_bytes[4h], 3600)
delta — absolute change over the interval. For gauges only, not counters.
delta(node_memory_MemFree_bytes[1h])
absent_over_time — like absent, but fires when a range vector is entirely empty. Useful for detecting metric gaps, not just missing series.
absent_over_time(up{job="node"}[5m])
age calculation — how long since something last succeeded:
time() - batch_last_success_timestamp_seconds
Aggregations
sum(rate(container_cpu_usage_seconds_total[1m])) by (namespace)
count(up{job="demo"})
topk(5, sum(...) by (namespace))
bottomk(5, topk(6, pid_usage)) # top 5 excluding the single highest
without excludes listed labels, by keeps only listed labels:
sum without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
count without(device)(node_disk_read_bytes_total)
quantile_over_time — aggregate a time series over past data:
quantile_over_time(0.95, process_resident_memory_bytes[10m])
Boolean filtering — > bool returns 0/1 instead of filtering. Useful in recording rules and arithmetic:
go_goroutines > bool 100
Offset — compare current value against the past. Useful for week-over-week dashboards:
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1w)
Label manipulation
label_replace — set a label via regex:
label_replace(
kube_node_labels,
"instance", "$1.$2.$3.$4:9100",
"label_kubernetes_io_hostname",
"ip-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)\\.ec2\\.internal"
)
label_join — join multiple label values into one:
label_join(node_filesystem_size_bytes, "device_fstype", ",", "device", "fstype")
Note: label names with hyphens (e.g. node.kubernetes.io/instance-type) become underscores in __meta_* labels: __meta_kubernetes_node_label_node_kubernetes_io_instance_type.
Useful queries
Memory — use working_set_bytes, not rss. This is what the OOMkiller sees.
container_memory_working_set_bytes{container!="POD", image!=""}
Network traffic by namespace and AZ:
topk(5,
sum(
sum(
rate(container_network_transmit_bytes_total[5m])
+ rate(container_network_receive_bytes_total[5m])
) by (pod, namespace, node)
* on (node) group_left(label_topology_kubernetes_io_zone) kube_node_labels
) by (namespace, label_topology_kubernetes_io_zone)
)
EKS nodegroup memory usage:
avg by (label_eks_amazonaws_com_nodegroup) (
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
* on(instance) group_left(label_eks_amazonaws_com_nodegroup)
(label_replace(
max by (label_kubernetes_io_hostname, label_eks_amazonaws_com_nodegroup)(kube_node_labels),
"instance", "$1.$2.$3.$4:9100",
"label_kubernetes_io_hostname",
"ip-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)\\.ec2\\.internal"
))
)
CPU utilization %:
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Pod restarts:
rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15
Error rate (5xx):
sum by(service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by(service) (rate(http_requests_total[5m]))
SLO availability over 30 days:
sum(rate(http_requests_total{code=~"2..|3.."}[30d]))
/ sum(rate(http_requests_total[30d])) * 100
Grafana tip: use $__range to match the dashboard time range automatically:
sum by(uri) (increase(http_requests_total[$__range]))
Cardinality — find expensive metrics:
topk(10, count by (job)({job=~".+"}))
topk(10, count by (__name__)({job="my-service"}))
Allow AZ label on nodes for kube-state-metrics:
kube-state-metrics:
metricLabelsAllowlist:
- nodes=[topology.kubernetes.io/zone]
Relabeling tricks
Drop whole metrics or time series you don’t need — saves cardinality:
metric_relabel_configs:
# drop specific metric names
- source_labels: [__name__]
regex: '(container_tasks_state|container_memory_failures_total)'
action: drop
# drop by label value
- source_labels: [id]
regex: '/system.slice/var-lib-docker-containers.*-shm.mount'
action: drop
# drop a label entirely
- regex: 'container_label_com_amazonaws_ecs_task_arn'
action: labeldrop
Pre-commit hooks for rules validation
Catch broken configs and rules before they reach the cluster:
- id: check-config
stages: [commit]
name: Check prometheus config files
language: docker_image
entry: --entrypoint /bin/promtool prom/prometheus:latest
args: [check, config]
- id: check-rules
stages: [commit]
name: Check prometheus rule files
language: docker_image
entry: --entrypoint /bin/promtool prom/prometheus:latest
args: [check, rules]
- id: test-rules
stages: [commit]
name: Unit test prometheus rule files
language: docker_image
entry: --entrypoint /bin/promtool prom/prometheus:latest
args: [test, rules]
Also worth using: pint — a linter for PromQL rules.
Architecture notes
Prometheus agent mode — scrapes and remote-writes, no local storage or querying. Use it when you only need forwarding to a central TSDB.
Long-term storage options:
- VictoriaMetrics — efficient compression, global query view, low resource usage. Cluster version supports remote storage.
vmagentas a lightweight drop-in for scraping. - Thanos — S3-backed long-term storage, deduplication for HA pairs, global querying. More complex to operate.
- Mimir — Grafana’s horizontally scalable Prometheus backend.
Common pattern: one Prometheus per cluster scraping locally + remote write to centralised Mimir/Victoria, single Grafana on top.
Useful tools
| Tool | What it does |
|---|---|
| promlens.com | Visual PromQL explainer |
| pint | PromQL linter |
| sloth | SLO-based alerting rules generator |
| kthxbye | Auto-extend alertmanager silences |
| karma | Alertmanager dashboard |
| awesome-prometheus-alerts | Community alert rules library |
| vscode-promql | VS Code plugin for writing alert rules |