Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36462

control-plane-machine-set goes Available=False with UnavailableReplicas during etcd scale testing

XMLWordPrintable

    • Moderate
    • None
    • 3
    • ETCD Sprint 257
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the health checks for the etcd Operator were not ordered. As a consequence, the health check sometimes failed even though all etcd members were healthy. The health-check failure triggered a scale-down event that caused the Operator to prematurely remove a healthy member. With this release, the health checks in the Operator are ordered. As a result, the health checks correctly reflect the health of etcd members and an incorrect scale-down event does not occur. (link:https://issues.redhat.com/browse/OCPBUGS-36462[*OCPBUGS-36462*])
      Show
      * Previously, the health checks for the etcd Operator were not ordered. As a consequence, the health check sometimes failed even though all etcd members were healthy. The health-check failure triggered a scale-down event that caused the Operator to prematurely remove a healthy member. With this release, the health checks in the Operator are ordered. As a result, the health checks correctly reflect the health of etcd members and an incorrect scale-down event does not occur. (link: https://issues.redhat.com/browse/OCPBUGS-36462 [* OCPBUGS-36462 *])
    • Bug Fix
    • Done

      Description of problem

      Similar to OCPBUGS-20061, but for a different situation:

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&name=pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep 'failures match' | sort
      pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling (all) - 15 runs, 60% failed, 33% of failures match = 20% impact
      

      In that test, since ETCD-329, the test suite deletes a control-plane Machine and waits for the ControlPlaneMachineSet controller to scale in a replacement. But in runs like this, the outgoing Node goes Ready=Unknown for not-yet-diagnosed reasons, and that somehow misses cpmso#294's inertia (maybe the running guard should be dropped?), and the ClusterOperator goes Available=False complaining about Missing 1 available replica(s).

      It's not clear from the message which replica it's worried about (that would be helpful information to include in the message), but I suspect it's the Machine/Node that's in the deletion process. But regardless of the message, this does not seem like a situation worth a cluster-admin-midnight-page Available=False alarm.

      Version-Release number of selected component

      Seen in dev-branch CI. I haven't gone back to check older 4.y.

      How reproducible

      CI Search shows 20% impact, see my earlier query in this message.

      Steps to Reproduce

      Run a bunch of pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling and check CI Search results.

      Actual results

      20% impact

      Expected results

      No hits.

              rhn-coreos-htariq Haseeb Tariq
              trking W. Trevor King
              Ge Liu Ge Liu
              Laura Hinson Laura Hinson
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: