[OCPBUGS-23746] openshift-apiserver ClusterOperator should not blip Available=False on brief missing HTTP content-type - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.15, 4.16, 4.17, 4.18
Component/s: openshift-apiserver
Labels:
- api-triage-inprogress

Severity:
Moderate
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

: [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available expand_less
Run #0: Failed expand_less	1h28m25s
{  1 unexpected clusteroperator state transitions during e2e test run 

Nov 22 21:47:32.876 - 1s    E clusteroperator/openshift-apiserver condition/Available reason/APIServices_Error status/False APIServicesAvailable: rpc error: code = Unknown desc = malformed header: missing HTTP content-type}

While the Kube API server, if that's what's missing the header, is supposed to always be available, an issue that only persists for 1s is not long enough to warrant immediate admin intervention. Teaching the openshift-apiserver operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component (if applicable):

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-ppc64le (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-s390x (all) - 4 runs, 25% failed, 200% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 5 runs, 100% failed, 40% of failures match = 40% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 200% of failures match = 40% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 56% failed, 21% of failures match = 12% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 13% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 63% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 11% of failures match = 2% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-ovn-etcd-scaling (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-ibmcloud-csi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 14% of failures match = 7% impact

The impact rates are low enough that I haven't checked older 4.y. And it's possible that some of those matches have the operator going Available=False for other reasons besides APIServices_Error:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/openshift-apiserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
      2 openshift-apiserver APIServerDeployment_NoPod
      2 openshift-apiserver APIServerDeployment_PreconditionNotFulfilled
     19 openshift-apiserver APIServices_Error
     22 openshift-apiserver APIServerDeployment_NoDeployment

How reproducible:

12% impact for periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade looks like the highest impact among the jobs with double-digit run counts.

Steps to Reproduce:

Run periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade a bunch of times watching the openshift-apiserver ClusterOperator's Available condition.

Actual results:

Some very brief blips of Available=False that self-resolve before an admin could possibly resolve to the summons.

Expected results:

No quickly-resolving blips in CI. No long runs of Available=False for issues that don't seem worth summoning an admin. Still going Available=False for outages that need immediate admin response.

is related to

OCPBUGS-24228 machine-config ClusterOperator should not blip Available=False on brief missing HTTP content-type

Closed

relates to

OCPBUGS-44887 [CI] [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available

Closed

OTA-362 CI: fail update suite if any ClusterOperator go Available=False

Closed

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide