Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48325

[release-4.12] control-plane-machine-set goes Available=False with UnavailableReplicas during updates

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • 4.12.z
    • 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
    • None
    • Moderate
    • No
    • CLOUD Sprint 265
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, during node reboots, especially during update operations, the node that interacts with the rebooting machine entered a `Ready=Unknown` state for a short amount of time. This situation caused the Control Plane Machine Set Operator to enter an `UnavailableReplicas` condition and then an `Available=false` state. The `Available=false` state triggers alerts that demand urgent action, but in this case, intervention was only required for a short period of time until the node rebooted. With this release, a grace period for node unreadiness is provided where if a node enters an unready state, the Control Plane Machine Set Operator does not instantly enter an `UnavailableReplicas` condition or an `Available=false` state. (link:https://issues.redhat.com/browse/OCPBUGS-34971[*OCPBUGS-34971*]).
      Show
      * Previously, during node reboots, especially during update operations, the node that interacts with the rebooting machine entered a `Ready=Unknown` state for a short amount of time. This situation caused the Control Plane Machine Set Operator to enter an `UnavailableReplicas` condition and then an `Available=false` state. The `Available=false` state triggers alerts that demand urgent action, but in this case, intervention was only required for a short period of time until the node rebooted. With this release, a grace period for node unreadiness is provided where if a node enters an unready state, the Control Plane Machine Set Operator does not instantly enter an `UnavailableReplicas` condition or an `Available=false` state. (link: https://issues.redhat.com/browse/OCPBUGS-34971 [* OCPBUGS-34971 *]).
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-48259. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-48211. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-20061. The following is the description of the original issue:

      Description of problem:

      Possibly reviving OCPBUGS-10771, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:

      : [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less	1h34m30s
      {  3 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:
      
      Oct 03 22:03:29.822 - 106s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      Oct 03 22:08:34.162 - 98s   E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      Oct 03 22:13:01.645 - 118s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
      

      But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

      Version-Release number of selected component (if applicable):

      4.15. Possibly all supported versions of the CPMS operator have this exposure.

      How reproducible:

      Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

      Steps to Reproduce:

      w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
      

      Actual results:

      periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
      periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact
      periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact
      

      Expected results:

      CPMS goes Available=False if and only if immediate admin intervention is appropriate.

              ddonati@redhat.com Damiano Donati
              openshift-crt-jira-prow OpenShift Prow Bot
              Milind Yadav Milind Yadav
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: