-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.15
-
Moderate
-
No
-
Rejected
-
False
-
Description of problem:
Reviving bugzilla#2010539, the authentication ClusterOperator occasionally blips Available=False with reason=WellKnown_NotReady. For example, this run includes:
: [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available expand_less 47m21s { 1 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Oct 03 19:11:20.502 - 245ms E clusteroperator/authentication condition/Available reason/WellKnown_NotReady status/False WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://10.0.0.3:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.3:6443: i/o timeout
While a dial timeout for the Kube API server isn't fantastic, an issue that only persists for 245ms is not long enough to warrant immediate admin intervention. Teaching the authentication operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
OCPBUGS-32089 tracks narrowly denoising WellKnown_NotReady. This ticket tracks more generic Available=False denoising.
Version-Release number of selected component (if applicable):
4.8, 4.10, and 4.15. Likely all supported versions of the authentication operator have this exposure.
How reproducible:
Looks like 10 to 50% of 4.15 runs have some kind of issue with authentication going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.
Steps to Reproduce:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
Actual results:
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 44% failed, 13% of failures match = 6% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 67% failed, 17% of failures match = 11% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-heterogeneous (all) - 18 runs, 56% failed, 30% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-serial-aws-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-serial-ovn-ppc64le-powervs (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 67% failed, 25% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 50% failed, 33% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 86% of failures match = 36% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 21% failed, 76% of failures match = 16% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn-techpreview-serial (all) - 7 runs, 29% failed, 100% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 28% failed, 36% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 39% failed, 123% of failures match = 48% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 71 runs, 49% failed, 80% of failures match = 39% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 7 runs, 100% failed, 57% of failures match = 57% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 34% failed, 4% of failures match = 1% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-azure-sdn (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial (all) - 7 runs, 43% failed, 67% of failures match = 29% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-serial-ovn-ipv6 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 71% failed, 20% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 100% failed, 57% of failures match = 57% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 12 runs, 58% failed, 14% of failures match = 8% impact
Digging into reason and message frequency in 4.15-releated update CI:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's/[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*/x.x.x.x/g;s|[.]apps[.][^/]*|.apps.../|g' | sort | uniq -c | sort -n 1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout 1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 1 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 1 authentication APIServices_Error rpc error: code = Unavailable desc = the connection is draining 1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp: lookup oauth-openshift.apps.../ 1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp x.x.x.x:443: connect: connection refused 1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://[fd02::410f]:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 1 Nov 28 09:09:40.407 - 1s E clusteroperator/authentication condition/Available reason/APIServerDeployment_PreconditionNotFulfilled status/False 2 authentication APIServerDeployment_NoPod no .openshift-oauth-apiserver pods available on any node. 2 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2 authentication APIServices_Error rpc error: code = Unknown desc = malformed header: missing HTTP content-type 4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" 4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout 6 authentication OAuthServerDeployment_NoDeployment deployment/openshift-authentication: could not be retrieved 7 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" 7 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: i/o timeout (Client.Timeout exceeded while awaiting headers) 8 authentication APIServerDeployment_NoPod no apiserver.openshift-oauth-apiserver pods available on any node. 9 authentication APIServerDeployment_NoDeployment deployment/openshift-oauth-apiserver: could not be retrieved 9 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": EOF 11 authentication WellKnown_NotReady The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://x.x.x.x:6443/.well-known/oauth-authorization-server: dial tcp x.x.x.x:6443: i/o timeout 23 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 26 authentication APIServices_Error "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 29 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" 29 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 30 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: connect: connection refused 34 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
And simplifying by looking only at reason:
curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n 1 authentication APIServerDeployment_PreconditionNotFulfilled 6 authentication OAuthServerDeployment_NoDeployment 8 authentication APIServerDeployment_NoDeployment 10 authentication APIServerDeployment_NoPod 11 authentication WellKnown_NotReady 36 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 43 authentication APIServices_PreconditionNotReady 66 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable 95 authentication APIServices_Error
Expected results:
Authentication goes Available=False if and only if immediate admin intervention is appropriate.
- is cloned by
-
OCPBUGS-32089 Authentication blips Available=False with WellKnown_NotReady
- Closed
- relates to
-
OTA-362 CI: fail update suite if any ClusterOperator go Available=False
- Closed
- links to