Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20061

control-plane-machine-set goes Available=False with UnavailableReplicas during updates

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • None
    • 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
    • None
    • Moderate
    • No
    • CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255
    • 7
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, a node associated with a rebooting machine briefly having a status of `Ready=Unknown` triggered the `UnavailableReplicas` condition in the Control Plane Machine Set Operator. This condition caused the Operator to enter the `Available=False` state and trigger alerts because that state indicates a nonfunctional component that requires immediate administrator intervention. This alert should not have been triggered for the brief and expected unavailabilty while rebooting. With this release, a grace period for node unreadiness is added to avoid triggering unnecessary alerts. (link:https://issues.redhat.com/browse/OCPBUGS-20061[*OCPBUGS-20061*])
      Show
      * Previously, a node associated with a rebooting machine briefly having a status of `Ready=Unknown` triggered the `UnavailableReplicas` condition in the Control Plane Machine Set Operator. This condition caused the Operator to enter the `Available=False` state and trigger alerts because that state indicates a nonfunctional component that requires immediate administrator intervention. This alert should not have been triggered for the brief and expected unavailabilty while rebooting. With this release, a grace period for node unreadiness is added to avoid triggering unnecessary alerts. (link: https://issues.redhat.com/browse/OCPBUGS-20061 [* OCPBUGS-20061 *])
    • Bug Fix
    • Done

      Description of problem:

      Possibly reviving OCPBUGS-10771, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:

      : [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less	1h34m30s
      {  3 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:
      
      Oct 03 22:03:29.822 - 106s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      Oct 03 22:08:34.162 - 98s   E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      Oct 03 22:13:01.645 - 118s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      

      But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

      Version-Release number of selected component (if applicable):

      4.15. Possibly all supported versions of the CPMS operator have this exposure.

      How reproducible:

      Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

      Steps to Reproduce:

      w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
      

      Actual results:

      periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
      periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact
      periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact
      

      Expected results:

      CPMS goes Available=False if and only if immediate admin intervention is appropriate.

              ddonati@redhat.com Damiano Donati
              trking W. Trevor King
              Milind Yadav Milind Yadav
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: