Problem:
In OpenShift clusters running for extended periods (e.g., over 1 or 2 years) without restarts, the Vector collector pods do not automatically reload rotated TLS certificates used for the Prometheus metrics endpoint.
Even if the cluster CA or service certificates are rotated automatically, the running Vector retains the old, expired certificate in memory. This results in Prometheus scraping failures and triggers critical alerts, as seen in the following:
Get "https://<pod-ip>:<port>/metrics": tls: failed to verify certificate: x509: certificate has expired "alertname": "CollectorNodeDown", "message": "Prometheus could not scrape openshift-logging/collector-xxxxx collector component for more than 10m."
Customer Impact:
- Administrators are forced to manually restart collector pods to refresh certificates and resolve alerts.
- Monitoring of the logging infrastructure is lost until manual intervention occurs.
- While regular cluster upgrades would recreate pods and avoid certificate expiration issues, in mission-critical systems (e.g., Banking) utilizing EUS versions (up to 4 years), frequent maintenance or restarts just to refresh certificates are not ideal.
Requested Enhancement:
We need a dynamic certificate reloading mechanism for the OpenShift Logging Vector collector. The Vector should be able to detect file changes for the metrics certificates and reload them without requiring a full pod restart. Alternatively, the Operator should handle the certificate rotation lifecycle more gracefully to ensure no manual intervention is required for long-running clusters.