Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63412

ETCD was unable to schedule one member due to Volume being bound.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17.z
    • HyperShift
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      
      During recent alert https://redhat.pagerduty.com/incidents/Q1NZBSK24IKWUF we noticed that the core issue was etcd-1 not
      being able to schedule on any node. The core issue was that the etcd-1 pod had a PVC attached which blocked the 
      scheduling on a different node. 
      
      The solution we applied was to 
      1. ensure that remaining etcd members are healthy by ssh into the pods and checking etcdctl endpoint health and forcing defragmentation
      2. delete the PVC followed by deleting the etcd-1 pod so that it can be rescheduled on a healthy node
      3. once the etcd-1 pod is up and healthy we checked the health of the etcd cluster again
      
      The control plane operator didn't attempt to fix the issue, the autoscaler didn't add new nodes despite "nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 14 node(s) had volume node affinity conflict, 2 Insufficient memory, 2 Too many pods, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) had untolerated taint {obo: true}, 7 node(s) didn't match pod anti-affinity rules, 88 node(s) had untolerated taint {hypershift.openshift.io/request-serving-component: true}. preemption: 0/122 nodes are available: 112 Preemption is not helpful for scheduling, 2 node(s) had volume node affinity conflict, 8 node(s) didn't match pod anti-affinity rules."
      
      There was an upgrade happening at the same time on the MC cluster which was attempting to replace the node with etcd-0.
      
      
          Version-Release number of selected component (if applicable):{code:none}
      
          

      How reproducible:

      
          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      
          

      Expected results:

      
          

      Additional info:

      
          

              Unassigned Unassigned
              pkotas Petr Kotas
              None
              None
              Yu Li Yu Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: