Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24228

machine-config ClusterOperator should not blip Available=False on brief missing HTTP content-type

XMLWordPrintable

    • Moderate
    • No
    • MCO Sprint 250, MCO Sprint 251
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Seen in 4.15 update CI:

      : [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
      Run #0: Failed	50m53s
      {  1 unexpected clusteroperator state transitions during e2e test run 
      
      Nov 28 21:08:32.700 - 6s    E clusteroperator/machine-config condition/Available reason/MachineConfigDaemonFailed status/False Cluster not available for [{operator 4.15.0-0.nightly-2023-11-28-101923}]: failed to apply machine config daemon manifests: rpc error: code = Unknown desc = malformed header: missing HTTP content-type}
      

      While the Kube API server, if that's what's missing the header, is supposed to always be available, an issue that only persists for 6s is not long enough to warrant immediate admin intervention. Teaching the machine-config operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

      Version-Release number of selected component (if applicable):

      $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/machine-config.*condition/Available.*status/False.*missing+HTTP+content-type' | grep 'failures match'
      periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 79 runs, 38% failed, 3% of failures match = 1% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 18% failed, 11% of failures match = 2% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 33% failed, 50% of failures match = 17% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 60 runs, 65% failed, 3% of failures match = 2% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
      periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 79 runs, 68% failed, 2% of failures match = 1% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 6% failed, 33% of failures match = 2% impact 

      The impact rates are low enough that I haven't checked older 4.y. And it's possible that some of those matches have the operator going Available=False for other reasons besides missing HTTP content-type:

      $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/machine-config.*condition/Available.*status/' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/\([^ ]*\)[^:]*: \(.*\)|\1 \3 \2 \4|' | grep -v '.notAfter: Required value' | sort | uniq -c | sort -n
            1 machine-config False MachineConfigDaemonFailed failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] Cluster not available for [{operator 4.15.0-0.okd-2023-11-27-134927}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
            1 machine-config False MachineConfigDaemonFailed failed to apply machine config daemon manifests: Operation cannot be fulfilled on daemonsets.apps "machine-config-daemon": the object has been modified; please apply your changes to the latest version and try again
            1 machine-config False MachineConfigPoolsFailed rpc error: code = Unknown desc = malformed header: missing HTTP content-type
            1 machine-config False MachineConfigServerFailed failed to apply machine config server manifests: rpc error: code = Internal desc = server closed the stream without sending trailers
            1 machine-config False MachineConfigServerFailed failed to apply machine config server manifests: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
            1 machine-config False MachineOSBuilderFailed failed to apply machine os builder manifests: rpc error: code = Unavailable desc = the connection is draining
            2 machine-config False MachineConfigControllerFailed failed to apply machine config controller manifests: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
            2 machine-config False MachineOSBuilderFailed failed to apply machine os builder manifests: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
            3 machine-config False MachineConfigDaemonFailed failed to apply machine config daemon manifests: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
      

      I'm excluding notAfter: Required value there, because that's already tracked in OCPBUGS-22364.

      How reproducible:

      2% impact for periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade looks like the highest impact among the jobs with double-digit run counts and fairly boring job coverage (e.g. no 4.14-to-4.15 minor-version bump or rollback).

      Steps to Reproduce:

      Run periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade a bunch of times watching the machine-config ClusterOperator's Available condition.

      Actual results:

      Some very brief blips of Available=False that self-resolve before an admin could possibly resolve to the summons.

      Expected results:

      No quickly-resolving blips in CI. No long runs of Available=False for issues that don't seem worth summoning an admin. Still going Available=False for outages that need immediate admin response.

            djoshy David Joshy
            trking W. Trevor King
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: