Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20061

control-plane-machine-set goes Available=False with UnavailableReplicas during updates

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
    • None
    • Moderate
    • No
    • CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255
    • 7
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Possibly reviving OCPBUGS-10771, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:

      : [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less	1h34m30s
      {  3 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:
      
      Oct 03 22:03:29.822 - 106s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      Oct 03 22:08:34.162 - 98s   E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      Oct 03 22:13:01.645 - 118s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      

      But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

      Version-Release number of selected component (if applicable):

      4.15. Possibly all supported versions of the CPMS operator have this exposure.

      How reproducible:

      Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

      Steps to Reproduce:

      w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
      

      Actual results:

      periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
      periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact
      periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact
      

      Expected results:

      CPMS goes Available=False if and only if immediate admin intervention is appropriate.

            ddonati@redhat.com Damiano Donati
            trking W. Trevor King
            Milind Yadav Milind Yadav
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: