Description of problem:
A ROSA cluster running Konflux is unhealthy and inaccessible to SRE. We've managed to directly SSH into control-plane nodes to troubleshoot the issue, and it appears that etcd pods are routinely starting up, forming a quorum, then dying without a clear cause. As a result, the cluster is extremely unhealthy.
Version-Release number of selected component (if applicable):
4.15.36
How reproducible:
At the moment - very. Not clear how we can recreate this on a separate cluster
Steps to Reproduce:
1.
2.
3.
Actual results:
Cluster is unresponsive, etcd cannot seem to hold a quorum after initially forming it
Expected results:
etcd holds quorum after forming it initially
Additional info:
Current theory is that excessive querying from customer workloads may be contributing, but we're still working to prove/disprove this (main workload is tekton, which is known to be extremely resource intensive, and cluster has had its control-plane repeatedly scaled to accommodate this)