Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37820

[4.16] control-plane-machine-set goes Available=False with UnavailableReplicas during etcd scale testing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.16.z
    • 4.17
    • Etcd
    • None
    • Moderate
    • None
    • 1
    • ETCD Sprint 257, ETCD Sprint 258
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      This is a clone of issue OCPBUGS-36462. The following is the description of the original issue:

      Description of problem

      Similar to OCPBUGS-20061, but for a different situation:

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&name=pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep 'failures match' | sort
      pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling (all) - 15 runs, 60% failed, 33% of failures match = 20% impact
      

      In that test, since ETCD-329, the test suite deletes a control-plane Machine and waits for the ControlPlaneMachineSet controller to scale in a replacement. But in runs like this, the outgoing Node goes Ready=Unknown for not-yet-diagnosed reasons, and that somehow misses cpmso#294's inertia (maybe the running guard should be dropped?), and the ClusterOperator goes Available=False complaining about Missing 1 available replica(s).

      It's not clear from the message which replica it's worried about (that would be helpful information to include in the message), but I suspect it's the Machine/Node that's in the deletion process. But regardless of the message, this does not seem like a situation worth a cluster-admin-midnight-page Available=False alarm.

      Version-Release number of selected component

      Seen in dev-branch CI. I haven't gone back to check older 4.y.

      How reproducible

      CI Search shows 20% impact, see my earlier query in this message.

      Steps to Reproduce

      Run a bunch of pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling and check CI Search results.

      Actual results

      20% impact

      Expected results

      No hits.

            tjungblu@redhat.com Thomas Jungblut
            openshift-crt-jira-prow OpenShift Prow Bot
            Ge Liu Ge Liu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: