Published : 2022-04-07

Prometheus alert on node reboot

If you have node_exporter installed and scrapped by prometheus, you have access to an interesting metric: node_boot_time_seconds, which represent the timestamp in seconds when machine has booted.

To have the information about the reboot we should first verify if the current uptime is low, but also if before we have a higher uptime. If you only do the first condition, when a node is created you will have an alert too.

We will then create 2 alerts, one to check if node has rebooted in the last 10 mins, and the second over the last hour. offset is clearly the keyword permitting to verify old metric state here.

groups:
- name: node-exporter.rules
  rules:
  - alert: NodeHasRebooted
    annotations:
      description: Node has rebooted
      summary: Node {{ (or $labels.node $labels.instance) }} has rebooted {{ $value }} seconds ago.
    expr: |
            (time() - node_boot_time_seconds < 600) and (time() - 600 - (node_boot_time_seconds offset 10m) > 600)
    labels:
      severity: critical

  - alert: NodeHasRebooted
    annotations:
      description: Node has rebooted
      summary: Node {{ (or $labels.node $labels.instance) }} has rebooted {{ $value }} seconds ago.
    expr: |
            (time() - node_boot_time_seconds < 3600) and (time() - 3600 - (node_boot_time_seconds offset 60m) > 3600)
    labels:
      severity: warning

Enjoy !