-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.17.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During recent alert https://redhat.pagerduty.com/incidents/Q1NZBSK24IKWUF we noticed that the core issue was etcd-1 not being able to schedule on any node. The core issue was that the etcd-1 pod had a PVC attached which blocked the scheduling on a different node. The solution we applied was to 1. ensure that remaining etcd members are healthy by ssh into the pods and checking etcdctl endpoint health and forcing defragmentation 2. delete the PVC followed by deleting the etcd-1 pod so that it can be rescheduled on a healthy node 3. once the etcd-1 pod is up and healthy we checked the health of the etcd cluster again The control plane operator didn't attempt to fix the issue, the autoscaler didn't add new nodes despite "nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 14 node(s) had volume node affinity conflict, 2 Insufficient memory, 2 Too many pods, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) had untolerated taint {obo: true}, 7 node(s) didn't match pod anti-affinity rules, 88 node(s) had untolerated taint {hypershift.openshift.io/request-serving-component: true}. preemption: 0/122 nodes are available: 112 Preemption is not helpful for scheduling, 2 node(s) had volume node affinity conflict, 8 node(s) didn't match pod anti-affinity rules." There was an upgrade happening at the same time on the MC cluster which was attempting to replace the node with etcd-0.
Version-Release number of selected component (if applicable):{code:none}
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info: