Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8847

Automatic certificate reloading for Vector collector to prevent CollectorNodeDown alerts in long-running clusters

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Logging
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Problem:
      In OpenShift clusters running for extended periods (e.g., over 1 or 2 years) without restarts, the Vector collector pods do not automatically reload rotated TLS certificates used for the Prometheus metrics endpoint.
      Even if the cluster CA or service certificates are rotated automatically, the running Vector retains the old, expired certificate in memory. This results in Prometheus scraping failures and triggers critical alerts, as seen in the following:

      Get "https://<pod-ip>:<port>/metrics": tls: failed to verify certificate: x509: certificate has expired
      
      "alertname": "CollectorNodeDown",
      "message": "Prometheus could not scrape openshift-logging/collector-xxxxx collector component for more than 10m."

      Customer Impact:

      • Administrators are forced to manually restart collector pods to refresh certificates and resolve alerts.
      • Monitoring of the logging infrastructure is lost until manual intervention occurs.
      • While regular cluster upgrades would recreate pods and avoid certificate expiration issues, in mission-critical systems (e.g., Banking) utilizing EUS versions (up to 4 years), frequent maintenance or restarts just to refresh certificates are not ideal.

      Requested Enhancement:
      We need a dynamic certificate reloading mechanism for the OpenShift Logging Vector collector. The Vector should be able to detect file changes for the metrics certificates and reload them without requiring a full pod restart. Alternatively, the Operator should handle the certificate rotation lifecycle more gracefully to ensure no manual intervention is required for long-running clusters.

              jamparke@redhat.com Jamie Parker
              rhn-support-yuokada Yuki Okada
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                None
                None