Cert-manager Grafana dashboard and alert rules setup

Dashboard/Alert cert-manager
Dashboard/Alert cert-manager

Introduction

cert-manager is one of those Kubernetes components that quietly does important work in the background. When certificates renew successfully, nobody thinks about it. When they fail, the first sign is often an expired TLS certificate in front of a user-facing service.

In this guide, we will install cert-manager with Prometheus metrics enabled, import a Grafana dashboard using a dashboard ConfigMap, and create Prometheus alert rules for certificates that are not ready, already expired, or expiring soon.

Prerequisites

  1. Kubernetes cluster
  2. kube-prometheus-stack already installed
  3. Helm installed locally
  4. Grafana sidecar configured to load dashboard ConfigMaps with the grafana_dashboard: '1' label
  5. kubectl access to apply and verify the manifests

Step-by-step

  1. Add the Jetstack Helm repository:

    bash
    helm repo add jetstack https://charts.jetstack.io
    helm repo update
    
  2. Create a cert-manager-values.yaml file with ServiceMonitor enabled. The release: kube-prom-stack label lets kube-prometheus-stack discover the ServiceMonitor:

    yaml
    crds:
      enabled: true
    prometheus:
      servicemonitor:
        enabled: true
        prometheusInstance: kube-prom-stack
        labels:
          release: kube-prom-stack
    extraArgs:
      - --dns01-recursive-nameservers=8.8.8.8:53,8.8.4.4:53
      - --dns01-recursive-nameservers-only
    
  3. Install cert-manager:

    bash
    helm upgrade --install cert-manager jetstack/cert-manager \
      --namespace cert-manager \
      --create-namespace \
      --version '~1.19.0' \
      --values cert-manager-values.yaml
    
  4. Create a trust-manager-values.yaml file:

    yaml
    crds:
      enabled: true
    app:
      metrics:
        service:
          servicemonitor:
            enabled: true
            labels:
              release: kube-prom-stack
    
  5. Install trust-manager:

    bash
    helm upgrade --install trust-manager jetstack/trust-manager \
      --namespace cert-manager \
      --version '~0.19.0' \
      --values trust-manager-values.yaml
    
  6. Verify that cert-manager is running:

    bash
    kubectl get pods -n cert-manager
    kubectl get servicemonitor -n cert-manager
    
  7. Create a Grafana dashboard ConfigMap in the observability namespace. The grafana_dashboard: '1' label lets the Grafana sidecar discover and import it automatically:

    yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cert-manager-dashboard
      namespace: observability
      labels:
        grafana_dashboard: '1'
        app.kubernetes.io/instance: prometheus-community
    data:
      cert-manager.json: |-
        {
          "annotations": {
            "list": [
              {
                "builtIn": 1,
                "datasource": {
                  "type": "grafana",
                  "uid": "-- Grafana --"
                },
                "enable": true,
                "hide": true,
                "iconColor": "rgba(0, 211, 255, 1)",
                "name": "Annotations & Alerts",
                "target": {
                  "limit": 100,
                  "matchAny": false,
                  "tags": [],
                  "type": "dashboard"
                },
                "type": "dashboard"
              }
            ]
          },
          "description": "The dashboard gives an overview of the SSL certs managed by cert-manager in Kubernetes",
          "editable": true,
          "fiscalYearStartMonth": 0,
          "gnetId": 20842,
          "graphTooltip": 0,
          "links": [],
          "liveNow": false,
          "panels": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "The number if available certificates",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "thresholds"
                  },
                  "mappings": [],
                  "noValue": "0",
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "green",
                        "value": null
                      }
                    ]
                  }
                },
                "overrides": []
              },
              "gridPos": {
                "h": 8,
                "w": 8,
                "x": 0,
                "y": 0
              },
              "id": 1,
              "options": {
                "colorMode": "value",
                "graphMode": "none",
                "justifyMode": "auto",
                "orientation": "auto",
                "reduceOptions": {
                  "calcs": [
                    "lastNotNull"
                  ],
                  "fields": "",
                  "values": false
                },
                "textMode": "value",
                "wideLayout": true
              },
              "pluginVersion": "11.2.0",
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": false,
                  "expr": "count(certmanager_certificate_ready_status{condition=\"True\", cluster=~\"$cluster\", exported_namespace=~\"$namespace\"})",
                  "instant": true,
                  "legendFormat": "__auto",
                  "range": false,
                  "refId": "A"
                }
              ],
              "title": "Valid Certificates",
              "type": "stat"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "The number of certificates that will expire within the next 14 days",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "thresholds"
                  },
                  "mappings": [],
                  "noValue": "0",
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "green",
                        "value": null
                      },
                      {
                        "color": "#EAB839",
                        "value": 1
                      }
                    ]
                  }
                },
                "overrides": []
              },
              "gridPos": {
                "h": 8,
                "w": 8,
                "x": 8,
                "y": 0
              },
              "id": 3,
              "options": {
                "colorMode": "value",
                "graphMode": "none",
                "justifyMode": "auto",
                "orientation": "auto",
                "reduceOptions": {
                  "calcs": [
                    "lastNotNull"
                  ],
                  "fields": "",
                  "values": false
                },
                "textMode": "auto",
                "wideLayout": true
              },
              "pluginVersion": "11.2.0",
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": false,
                  "expr": "count(certmanager_certificate_expiration_timestamp_seconds{cluster=~\"$cluster\", exported_namespace=~\"$namespace\"} < (time()+(14*24*3600)))",
                  "instant": true,
                  "legendFormat": "{{exported_namespace}}/{{name}}",
                  "range": false,
                  "refId": "A"
                }
              ],
              "title": "Expiring Certificates",
              "type": "stat"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "Total number of HTTP requests, based on the selected time range",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "thresholds"
                  },
                  "decimals": 0,
                  "mappings": [],
                  "noValue": "0",
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "text",
                        "value": null
                      }
                    ]
                  }
                },
                "overrides": []
              },
              "gridPos": {
                "h": 8,
                "w": 8,
                "x": 16,
                "y": 0
              },
              "id": 2,
              "options": {
                "colorMode": "value",
                "graphMode": "none",
                "justifyMode": "auto",
                "orientation": "auto",
                "reduceOptions": {
                  "calcs": [
                    "lastNotNull"
                  ],
                  "fields": "",
                  "values": false
                },
                "textMode": "auto",
                "wideLayout": true
              },
              "pluginVersion": "11.2.0",
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": false,
                  "expr": "sum(increase(certmanager_http_acme_client_request_count{cluster=~\"$cluster\"}[$__range]))",
                  "instant": true,
                  "legendFormat": "__auto",
                  "range": false,
                  "refId": "A"
                }
              ],
              "title": "Total ACME Requests",
              "type": "stat"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "Time before the certificates expire. Only shows certificates expiring within 45 days",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "thresholds"
                  },
                  "mappings": [],
                  "min": 0,
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "red",
                        "value": null
                      },
                      {
                        "color": "orange",
                        "value": 14
                      },
                      {
                        "color": "green",
                        "value": 30
                      },
                      {
                        "color": "dark-green",
                        "value": 60
                      }
                    ]
                  },
                  "unit": "s"
                },
                "overrides": []
              },
              "gridPos": {
                "h": 10,
                "w": 12,
                "x": 0,
                "y": 8
              },
              "id": 5,
              "options": {
                "displayMode": "gradient",
                "maxVizHeight": 300,
                "minVizHeight": 16,
                "minVizWidth": 8,
                "namePlacement": "left",
                "orientation": "horizontal",
                "reduceOptions": {
                  "calcs": [
                    "lastNotNull"
                  ],
                  "fields": "",
                  "values": false
                },
                "showUnfilled": true,
                "sizing": "auto",
                "valueMode": "color"
              },
              "pluginVersion": "11.2.0",
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": false,
                  "expr": "sort(certmanager_certificate_expiration_timestamp_seconds{exported_namespace=~\"$namespace\"} - time()) < 45 * (24*3600)",
                  "format": "time_series",
                  "instant": true,
                  "legendFormat": "{{cluster}} - {{exported_namespace}} - {{name}}",
                  "range": false,
                  "refId": "A"
                }
              ],
              "title": "Time to Expiration (<45 days)",
              "type": "bargauge"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "Time before the certificates are automatically renewed",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "thresholds"
                  },
                  "mappings": [],
                  "min": 0,
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "green",
                        "value": null
                      }
                    ]
                  },
                  "unit": "s"
                },
                "overrides": []
              },
              "gridPos": {
                "h": 10,
                "w": 12,
                "x": 12,
                "y": 8
              },
              "id": 6,
              "options": {
                "displayMode": "gradient",
                "maxVizHeight": 300,
                "minVizHeight": 16,
                "minVizWidth": 8,
                "namePlacement": "left",
                "orientation": "horizontal",
                "reduceOptions": {
                  "calcs": [
                    "lastNotNull"
                  ],
                  "fields": "",
                  "values": false
                },
                "showUnfilled": true,
                "sizing": "auto",
                "valueMode": "color"
              },
              "pluginVersion": "11.2.0",
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": false,
                  "expr": "sort(certmanager_certificate_renewal_timestamp_seconds{exported_namespace=~\"$namespace\"} - time()) < 45 * (24*3600)",
                  "format": "time_series",
                  "instant": true,
                  "legendFormat": "{{cluster}} - {{exported_namespace}} - {{name}}",
                  "range": false,
                  "refId": "A"
                }
              ],
              "title": "Time to Automatic Renewal (<45 days)",
              "type": "bargauge"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "Time before the certificates expire",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "palette-classic"
                  },
                  "mappings": [],
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "green",
                        "value": null
                      }
                    ]
                  },
                  "unit": "d"
                },
                "overrides": []
              },
              "gridPos": {
                "h": 9,
                "w": 24,
                "x": 0,
                "y": 18
              },
              "id": 4,
              "options": {
                "legend": {
                  "displayMode": "list",
                  "placement": "right",
                  "showLegend": true
                },
                "tooltip": {
                  "mode": "single",
                  "sort": "none"
                }
              },
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": true,
                  "expr": "(certmanager_certificate_expiration_timestamp_seconds{exported_namespace=~\"$namespace\"} - time())/(24*3600)",
                  "instant": false,
                  "legendFormat": "{{cluster}} - {{exported_namespace}} - {{name}}",
                  "range": true,
                  "refId": "A"
                }
              ],
              "title": "Time to Expiration",
              "type": "timeseries"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "Displays the timestamp of a renewal, based on the expiration time changing",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "palette-classic"
                  },
                  "decimals": 1,
                  "mappings": [],
                  "min": 0,
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "green",
                        "value": null
                      }
                    ]
                  },
                  "unit": "short"
                },
                "overrides": []
              },
              "gridPos": {
                "h": 9,
                "w": 24,
                "x": 0,
                "y": 27
              },
              "id": 8,
              "options": {
                "legend": {
                  "displayMode": "list",
                  "placement": "right",
                  "showLegend": true
                },
                "tooltip": {
                  "mode": "single",
                  "sort": "none"
                }
              },
              "pluginVersion": "8.3.2",
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": true,
                  "expr": "changes(certmanager_certificate_renewal_timestamp_seconds{cluster=~\"$cluster\", exported_namespace=~\"$namespace\"}[15m]) > 0",
                  "format": "time_series",
                  "instant": false,
                  "legendFormat": "{{cluster}} - {{exported_namespace}} - {{name}}",
                  "range": false,
                  "refId": "A"
                }
              ],
              "title": "Certificate Renewal Events",
              "type": "timeseries"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${datasource}"
              },
              "description": "Total number of HTTP requests, based on the selected range",
              "fieldConfig": {
                "defaults": {
                  "color": {
                    "mode": "palette-classic"
                  },
                  "mappings": [],
                  "noValue": "0",
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      {
                        "color": "text",
                        "value": null
                      }
                    ]
                  }
                },
                "overrides": []
              },
              "gridPos": {
                "h": 10,
                "w": 24,
                "x": 0,
                "y": 36
              },
              "id": 10,
              "options": {
                "barRadius": 0,
                "barWidth": 0.97,
                "fullHighlight": false,
                "groupWidth": 0.7,
                "legend": {
                  "displayMode": "list",
                  "placement": "bottom",
                  "showLegend": true
                },
                "orientation": "auto",
                "showValue": "auto",
                "stacking": "none",
                "tooltip": {
                  "mode": "single",
                  "sort": "none"
                },
                "xTickLabelRotation": -30,
                "xTickLabelSpacing": 0
              },
              "pluginVersion": "8.3.2",
              "targets": [
                {
                  "datasource": {
                    "type": "prometheus",
                    "uid": "${datasource}"
                  },
                  "editorMode": "code",
                  "exemplar": false,
                  "expr": "sort_desc(sum by (cluster)(increase(certmanager_http_acme_client_request_count{cluster=~\"$cluster\"}[$__range]))) > 0",
                  "format": "table",
                  "instant": true,
                  "legendFormat": "{{cluster}}",
                  "range": false,
                  "refId": "A"
                }
              ],
              "title": "ACME Requests by Cluster",
              "type": "barchart"
            }
          ],
          "schemaVersion": 39,
          "tags": [
            "k8s",
            "cert-manager"
          ],
          "templating": {
            "list": [
              {
                "current": {
                  "selected": false,
                  "text": "default",
                  "value": "default"
                },
                "hide": 0,
                "includeAll": false,
                "multi": false,
                "name": "datasource",
                "options": [],
                "query": "prometheus",
                "queryValue": "",
                "refresh": 1,
                "regex": "",
                "skipUrlSync": false,
                "type": "datasource"
              },
              {
                "current": {
                  "selected": false,
                  "text": "All",
                  "value": "$__all"
                },
                "datasource": {
                  "type": "prometheus",
                  "uid": "${datasource}"
                },
                "definition": "label_values(certmanager_clock_time_seconds, cluster)",
                "hide": 0,
                "includeAll": true,
                "multi": false,
                "name": "cluster",
                "options": [],
                "query": {
                  "query": "label_values(certmanager_clock_time_seconds, cluster)",
                  "refId": "StandardVariableQuery"
                },
                "refresh": 1,
                "regex": "",
                "skipUrlSync": false,
                "sort": 0,
                "type": "query"
              },
              {
                "allValue": ".*",
                "current": {
                  "selected": false,
                  "text": "All",
                  "value": "$__all"
                },
                "datasource": {
                  "type": "prometheus",
                  "uid": "${datasource}"
                },
                "definition": "label_values(certmanager_certificate_ready_status{cluster=~\"$cluster\"}, exported_namespace)",
                "hide": 0,
                "includeAll": true,
                "multi": false,
                "name": "namespace",
                "options": [],
                "query": {
                  "query": "label_values(certmanager_certificate_ready_status{cluster=~\"$cluster\"}, exported_namespace)",
                  "refId": "StandardVariableQuery"
                },
                "refresh": 1,
                "regex": "",
                "skipUrlSync": false,
                "sort": 0,
                "type": "query"
              }
            ]
          },
          "time": {
            "from": "now-24h",
            "to": "now"
          },
          "timepicker": {},
          "timezone": "browser",
          "title": "Cert-manager-Kubernetes",
          "uid": "cdhrcds8aosg0c",
          "version": 2,
          "weekStart": ""
        }
    
  8. Apply the Grafana dashboard:

    bash
    kubectl apply -f cert-manager-dashboard.yaml
    
  9. Create Prometheus alert rules for cert-manager:

    yaml
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: cert-manager
      namespace: cert-manager
      labels:
        prometheus: kube-prom-stack
        role: alert-rules
        release: kube-prom-stack
    spec:
      groups:
        - name: cert-manager
          rules:
            - alert: CertManagerCertificateReadyStatus
              annotations:
                description: 'Certificate for "{{ $labels.name }}" is not ready.'
                summary: Certificate is not ready
              expr: certmanager_certificate_ready_status{condition="False"} == 1
              for: 10m
              labels:
                severity: critical
            - alert: CertManagerCertificateExpired
              annotations:
                description: 'Certificate "{{ $labels.exported_namespace }}/{{ $labels.name }}" expired {{ $value | humanizeDuration }} ago.'
                summary: Certificate has expired
              expr: time() - certmanager_certificate_expiration_timestamp_seconds > 0
              for: 5m
              labels:
                severity: critical
            - alert: CertManagerCertificateExpiringSoon
              annotations:
                description: 'Certificate "{{ $labels.exported_namespace }}/{{ $labels.name }}" expires in {{ $value | humanizeDuration }}.'
                summary: Certificate expires in less than 14 days
              expr: certmanager_certificate_expiration_timestamp_seconds - time() > 0 and certmanager_certificate_expiration_timestamp_seconds - time() < 14 * 24 * 60 * 60
              for: 1h
              labels:
                severity: warning
    
  10. Apply the alert rules:

    bash
    kubectl apply -f cert-manager-prometheus-rules.yaml
    
  11. Verify Prometheus and Grafana picked up the configuration:

    bash
    kubectl get prometheusrule -n cert-manager cert-manager
    kubectl get configmap -n observability cert-manager-dashboard
    
  12. Open Grafana and search for the Cert-manager-Kubernetes dashboard. You should see valid certificates, expiring certificates, ACME request count, renewal events, and time-to-expiration panels.

Conclusion

You now have cert-manager metrics flowing into Prometheus, a Grafana dashboard for certificate visibility, and alert rules for the most important certificate states. This gives you an early warning before certificates expire and a quick place to check renewal behavior across clusters and namespaces.

References

If you found this useful, you can buy me a coffee! Thanks for the support!