Key functions

rate vs irate

rate calculates per-second increase averaged over the whole range. Use it for alerts and graphs — it’s smoother.

irate only looks at the last two data points. Very responsive but noisy. Avoid for alerting.

rate(http_requests_total[5m])
irate(http_requests_total[5m])

increase — total increase over the range. Only use for display, not for alerts. Use rate for alerts.

increase(process_cpu_seconds_total[15m])

histogram_quantile — must apply rate first, then aggregate buckets, then quantile.

histogram_quantile(0.95,
  rate(request_duration_seconds_bucket{status_code="401"}[10m])
)

absent — returns 1 with labels from the selector if no time series match. Use it to alert on missing metrics.

absent(up{job="node"})

predict_linear — least-squares prediction. Good for disk-full alerts.

predict_linear(node_filesystem_free_bytes[4h], 3600)

delta — absolute change over the interval. For gauges only, not counters.

delta(node_memory_MemFree_bytes[1h])

absent_over_time — like absent, but fires when a range vector is entirely empty. Useful for detecting metric gaps, not just missing series.

absent_over_time(up{job="node"}[5m])

age calculation — how long since something last succeeded:

time() - batch_last_success_timestamp_seconds

Aggregations

sum(rate(container_cpu_usage_seconds_total[1m])) by (namespace)
count(up{job="demo"})
topk(5, sum(...) by (namespace))
bottomk(5, topk(6, pid_usage))   # top 5 excluding the single highest

without excludes listed labels, by keeps only listed labels:

sum without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
count without(device)(node_disk_read_bytes_total)

quantile_over_time — aggregate a time series over past data:

quantile_over_time(0.95, process_resident_memory_bytes[10m])

Boolean filtering> bool returns 0/1 instead of filtering. Useful in recording rules and arithmetic:

go_goroutines > bool 100

Offset — compare current value against the past. Useful for week-over-week dashboards:

rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1w)

Label manipulation

label_replace — set a label via regex:

label_replace(
  kube_node_labels,
  "instance", "$1.$2.$3.$4:9100",
  "label_kubernetes_io_hostname",
  "ip-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)\\.ec2\\.internal"
)

label_join — join multiple label values into one:

label_join(node_filesystem_size_bytes, "device_fstype", ",", "device", "fstype")

Note: label names with hyphens (e.g. node.kubernetes.io/instance-type) become underscores in __meta_* labels: __meta_kubernetes_node_label_node_kubernetes_io_instance_type.


Useful queries

Memory — use working_set_bytes, not rss. This is what the OOMkiller sees.

container_memory_working_set_bytes{container!="POD", image!=""}

Network traffic by namespace and AZ:

topk(5,
  sum(
    sum(
      rate(container_network_transmit_bytes_total[5m])
      + rate(container_network_receive_bytes_total[5m])
    ) by (pod, namespace, node)
    * on (node) group_left(label_topology_kubernetes_io_zone) kube_node_labels
  ) by (namespace, label_topology_kubernetes_io_zone)
)

EKS nodegroup memory usage:

avg by (label_eks_amazonaws_com_nodegroup) (
  (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
  / node_memory_MemTotal_bytes * 100
  * on(instance) group_left(label_eks_amazonaws_com_nodegroup)
  (label_replace(
    max by (label_kubernetes_io_hostname, label_eks_amazonaws_com_nodegroup)(kube_node_labels),
    "instance", "$1.$2.$3.$4:9100",
    "label_kubernetes_io_hostname",
    "ip-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)\\.ec2\\.internal"
  ))
)

CPU utilization %:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Pod restarts:

rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15

Error rate (5xx):

sum by(service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by(service) (rate(http_requests_total[5m]))

SLO availability over 30 days:

sum(rate(http_requests_total{code=~"2..|3.."}[30d]))
/ sum(rate(http_requests_total[30d])) * 100

Grafana tip: use $__range to match the dashboard time range automatically:

sum by(uri) (increase(http_requests_total[$__range]))

Cardinality — find expensive metrics:

topk(10, count by (job)({job=~".+"}))
topk(10, count by (__name__)({job="my-service"}))

Allow AZ label on nodes for kube-state-metrics:

kube-state-metrics:
  metricLabelsAllowlist:
    - nodes=[topology.kubernetes.io/zone]

Relabeling tricks

Drop whole metrics or time series you don’t need — saves cardinality:

metric_relabel_configs:
# drop specific metric names
- source_labels: [__name__]
  regex: '(container_tasks_state|container_memory_failures_total)'
  action: drop

# drop by label value
- source_labels: [id]
  regex: '/system.slice/var-lib-docker-containers.*-shm.mount'
  action: drop

# drop a label entirely
- regex: 'container_label_com_amazonaws_ecs_task_arn'
  action: labeldrop

Pre-commit hooks for rules validation

Catch broken configs and rules before they reach the cluster:

- id: check-config
  stages: [commit]
  name: Check prometheus config files
  language: docker_image
  entry: --entrypoint /bin/promtool prom/prometheus:latest
  args: [check, config]

- id: check-rules
  stages: [commit]
  name: Check prometheus rule files
  language: docker_image
  entry: --entrypoint /bin/promtool prom/prometheus:latest
  args: [check, rules]

- id: test-rules
  stages: [commit]
  name: Unit test prometheus rule files
  language: docker_image
  entry: --entrypoint /bin/promtool prom/prometheus:latest
  args: [test, rules]

Also worth using: pint — a linter for PromQL rules.


Architecture notes

Prometheus agent mode — scrapes and remote-writes, no local storage or querying. Use it when you only need forwarding to a central TSDB.

Long-term storage options:

  • VictoriaMetrics — efficient compression, global query view, low resource usage. Cluster version supports remote storage. vmagent as a lightweight drop-in for scraping.
  • Thanos — S3-backed long-term storage, deduplication for HA pairs, global querying. More complex to operate.
  • Mimir — Grafana’s horizontally scalable Prometheus backend.

Common pattern: one Prometheus per cluster scraping locally + remote write to centralised Mimir/Victoria, single Grafana on top.


Useful tools

Tool What it does
promlens.com Visual PromQL explainer
pint PromQL linter
sloth SLO-based alerting rules generator
kthxbye Auto-extend alertmanager silences
karma Alertmanager dashboard
awesome-prometheus-alerts Community alert rules library
vscode-promql VS Code plugin for writing alert rules