[OCPBUGS-20056] Single short-lived operand blip shouldn't cause authentication operator Available=False

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.15
Component/s: apiserver-auth
Labels:
- triaged

Severity:
Moderate
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Reviving bugzilla#2010539, the authentication ClusterOperator occasionally blips Available=False with reason=WellKnown_NotReady. For example, this run includes:

: [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available expand_less	47m21s
{  1 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Oct 03 19:11:20.502 - 245ms E clusteroperator/authentication condition/Available reason/WellKnown_NotReady status/False WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://10.0.0.3:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.3:6443: i/o timeout

While a dial timeout for the Kube API server isn't fantastic, an issue that only persists for 245ms is not long enough to warrant immediate admin intervention. Teaching the authentication operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

~~OCPBUGS-32089~~ tracks narrowly denoising WellKnown_NotReady. This ticket tracks more generic Available=False denoising.

Version-Release number of selected component (if applicable):

4.8, 4.10, and 4.15. Likely all supported versions of the authentication operator have this exposure.

How reproducible:

Looks like 10 to 50% of 4.15 runs have some kind of issue with authentication going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

Steps to Reproduce:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort

Actual results:

periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 44% failed, 13% of failures match = 6% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 67% failed, 17% of failures match = 11% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-heterogeneous (all) - 18 runs, 56% failed, 30% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-serial-aws-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-serial-ovn-ppc64le-powervs (all) - 6 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 67% failed, 25% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 50% failed, 33% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 86% of failures match = 36% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 21% failed, 76% of failures match = 16% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn-techpreview-serial (all) - 7 runs, 29% failed, 100% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 28% failed, 36% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 39% failed, 123% of failures match = 48% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 71 runs, 49% failed, 80% of failures match = 39% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 7 runs, 100% failed, 57% of failures match = 57% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 34% failed, 4% of failures match = 1% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-azure-sdn (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial (all) - 7 runs, 43% failed, 67% of failures match = 29% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-serial-ovn-ipv6 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 100% failed, 57% of failures match = 57% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 12 runs, 58% failed, 14% of failures match = 8% impact

Digging into reason and message frequency in 4.15-releated update CI:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's/[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*/x.x.x.x/g;s|[.]apps[.][^/]*|.apps.../|g' | sort | uniq -c | sort -n
      1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout
      1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
      1 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
      1 authentication APIServices_Error rpc error: code = Unavailable desc = the connection is draining
      1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp: lookup oauth-openshift.apps.../
      1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp x.x.x.x:443: connect: connection refused
      1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://[fd02::410f]:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      1 Nov 28 09:09:40.407 - 1s    E clusteroperator/authentication condition/Available reason/APIServerDeployment_PreconditionNotFulfilled status/False
      2 authentication APIServerDeployment_NoPod no .openshift-oauth-apiserver pods available on any node.
      2 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
      2 authentication APIServices_Error rpc error: code = Unknown desc = malformed header: missing HTTP content-type
      4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
      4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout
      6 authentication OAuthServerDeployment_NoDeployment deployment/openshift-authentication: could not be retrieved
      7 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
      7 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: i/o timeout (Client.Timeout exceeded while awaiting headers)
      8 authentication APIServerDeployment_NoPod no apiserver.openshift-oauth-apiserver pods available on any node.
      9 authentication APIServerDeployment_NoDeployment deployment/openshift-oauth-apiserver: could not be retrieved
      9 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": EOF
     11 authentication WellKnown_NotReady The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://x.x.x.x:6443/.well-known/oauth-authorization-server: dial tcp x.x.x.x:6443: i/o timeout
     23 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
     26 authentication APIServices_Error "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
     29 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
     29 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
     30 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: connect: connection refused
     34 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

And simplifying by looking only at reason:

 curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
      1 authentication APIServerDeployment_PreconditionNotFulfilled
      6 authentication OAuthServerDeployment_NoDeployment
      8 authentication APIServerDeployment_NoDeployment
     10 authentication APIServerDeployment_NoPod
     11 authentication WellKnown_NotReady
     36 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
     43 authentication APIServices_PreconditionNotReady
     66 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable
     95 authentication APIServices_Error

Expected results:

Authentication goes Available=False if and only if immediate admin intervention is appropriate.

is cloned by

OCPBUGS-32089 Authentication blips Available=False with WellKnown_NotReady

Closed

relates to

OTA-362 CI: fail update suite if any ClusterOperator go Available=False

Closed

links to

openshift/cluster-authentication-operator#664: OCPBUGS-20056: wellknown-readiness: perform several attempts to connect before going unavailable

Petr Muller added a comment - 2025/01/27 10:23 AM

curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].
value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
      3 authentication APIServices_PreconditionNotReady
      3 authentication OAuthServerDeployment_NoDeployment
      5 authentication WellKnown_NotReady
      8 authentication APIServices_Error
     14 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
     16 authentication APIServerDeployment_NoDeployment
     89 authentication APIServerDeployment_PreconditionNotFulfilled

Petr Muller added a comment - 2025/01/27 10:23 AM curl -s 'https: //search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[]. value[].context[] ' | sed ' s|.*clusteroperator/$[^ ]*$ condition/Available reason/$[^ ]*$ status/False.*|\1 \2|' | sort | uniq -c | sort -n 3 authentication APIServices_PreconditionNotReady 3 authentication OAuthServerDeployment_NoDeployment 5 authentication WellKnown_NotReady 8 authentication APIServices_Error 14 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 16 authentication APIServerDeployment_NoDeployment 89 authentication APIServerDeployment_PreconditionNotFulfilled

OpenShift Jira Bot added a comment - 2025/01/27 12:20 AM

This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

OpenShift Jira Bot added a comment - 2025/01/27 12:20 AM This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

W. Trevor King added a comment - 2024/10/28 9:00 PM

Still a problem in 4.18:

$ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/a
uthentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | 
sort | uniq -c | sort -n
      1 authentication APIServerDeployment_NoPod
      1 authentication APIServices_PreconditionNotReady
      1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable
      1 authentication WellKnown_NotReady
      6 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
     35 authentication APIServerDeployment_PreconditionNotFulfilled

W. Trevor King added a comment - 2024/10/28 9:00 PM Still a problem in 4.18: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/a uthentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/$[^ ]*$ condition/Available reason/$[^ ]*$ status/False.*|\1 \2|' | sort | uniq -c | sort -n 1 authentication APIServerDeployment_NoPod 1 authentication APIServices_PreconditionNotReady 1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable 1 authentication WellKnown_NotReady 6 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 35 authentication APIServerDeployment_PreconditionNotFulfilled

OpenShift Jira Bot added a comment - 2024/10/27 11:40 PM

OpenShift Jira Bot added a comment - 2024/10/27 11:40 PM This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

Petr Muller added a comment - 2024/06/21 2:00 PM - edited

Linking several tests that can fail because of a Available=False blip, so that this card shows up in Sippy:

[sig-arch][Early] Managed cluster should [apigroup:config.openshift.io] start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable before beginning upgrade [Early][Suite:upgrade]
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

Petr Muller added a comment - 2024/06/21 2:00 PM - edited Linking several tests that can fail because of a Available=False blip, so that this card shows up in Sippy: [sig-arch][Early] Managed cluster should [apigroup:config.openshift.io] start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable before beginning upgrade [Early][Suite:upgrade] [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

W. Trevor King added a comment - 2024/06/21 1:54 PM - edited

Checking in on the current 4.17 contributors:

$ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([
^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
      1 authentication APIServerDeployment_PreconditionNotFulfilled
      1 authentication ReadyIngressNodes_NoReadyIngressNodes
      4 authentication OAuthServerDeployment_NoDeployment
      5 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
      9 authentication APIServerDeployment_NoPod
     10 authentication APIServerDeployment_NoDeployment
     20 authentication APIServices_PreconditionNotReady
     93 authentication APIServices_Error

And sorting by portions of the message details that seem relevant:

$ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&context=0&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep -o '503, err = the server is currently
 unable to handle the request\|Timeout exceeded while awaiting headers\|have no addresses with port name\|PreconditionNotReady\|deployment/[a-z-]*: could not be retrieved\|no [^ ]* pods available on any node' | sort | uniq -c | sort -n
      1 no .openshift-oauth-apiserver pods available on any node
      4 deployment/openshift-authentication: could not be retrieved
      6 Timeout exceeded while awaiting headers
      8 no apiserver.openshift-oauth-apiserver pods available on any node
     10 deployment/openshift-oauth-apiserver: could not be retrieved
     40 PreconditionNotReady
     58 have no addresses with port name
     67 503, err = the server is currently unable to handle the request

So I'd guess retries here and/or here would help as the next leg of this effort. Did you want me to spin that out into a separate bug, like we did with ~~OCPBUGS-32089~~?

W. Trevor King added a comment - 2024/06/21 1:54 PM - edited Checking in on the current 4.17 contributors: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/$[ ^ ]*$ condition/Available reason/$[^ ]*$ status/False.*|\1 \2|' | sort | uniq -c | sort -n 1 authentication APIServerDeployment_PreconditionNotFulfilled 1 authentication ReadyIngressNodes_NoReadyIngressNodes 4 authentication OAuthServerDeployment_NoDeployment 5 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 9 authentication APIServerDeployment_NoPod 10 authentication APIServerDeployment_NoDeployment 20 authentication APIServices_PreconditionNotReady 93 authentication APIServices_Error And sorting by portions of the message details that seem relevant: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&context=0&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep -o '503, err = the server is currently unable to handle the request\|Timeout exceeded while awaiting headers\|have no addresses with port name\|PreconditionNotReady\|deployment/[a-z-]*: could not be retrieved\|no [^ ]* pods available on any node' | sort | uniq -c | sort -n 1 no .openshift-oauth-apiserver pods available on any node 4 deployment/openshift-authentication: could not be retrieved 6 Timeout exceeded while awaiting headers 8 no apiserver.openshift-oauth-apiserver pods available on any node 10 deployment/openshift-oauth-apiserver: could not be retrieved 40 PreconditionNotReady 58 have no addresses with port name 67 503, err = the server is currently unable to handle the request So I'd guess retries here and/or here would help as the next leg of this effort. Did you want me to spin that out into a separate bug, like we did with OCPBUGS-32089 ?

W. Trevor King added a comment - 2024/04/10 10:11 PM

I've spun out ~~OCPBUGS-32089~~ to cover just WellKnown_NotReady. That way we have the new bug to land the auth-operator#664 improvement on. I've retitled this bug to be more clearly generic, and it's still the one referenced by the test-suite exception.

W. Trevor King added a comment - 2024/04/10 10:11 PM I've spun out OCPBUGS-32089 to cover just WellKnown_NotReady . That way we have the new bug to land the auth-operator#664 improvement on. I've retitled this bug to be more clearly generic, and it's still the one referenced by the test-suite exception .

Scott Dodson added a comment - 2024/02/15 2:14 PM

Correct, we have strategic initiatives aimed at getting to a point where we don't have a bunch of negative signal during upgrades. Service Delivery is telling us that the signal emitted during upgrades is so full of no action required alerts or negative operator conditions that they have to silence all monitoring which creates blind spots for the rare real problem during upgrades.

Scott Dodson added a comment - 2024/02/15 2:14 PM Correct, we have strategic initiatives aimed at getting to a point where we don't have a bunch of negative signal during upgrades. Service Delivery is telling us that the signal emitted during upgrades is so full of no action required alerts or negative operator conditions that they have to silence all monitoring which creates blind spots for the rare real problem during upgrades.

Petr Muller added a comment - 2024/02/15 12:24 PM

The justification is that operators that go Available=False without good reason (its contract says If this is false, it means there is an outage. Someone is probably getting paged") cause a lot of noise especially during upgrades. It is one of the major problems for OSD, Chistoph Blecker:

> So the biggest alerts by far that are painful are ClusterOperatorDown and ClusterOperatorDegraded. In particular the authentication and network operators flap a ton during the upgrade

See also OSD-13696, ~~OTA-700~~ and ~~OCPSTRAT-835~~. Upgrades are noisy and customers hate that. Fake blips make noise and make testing harder (~~OTA-362~~, ~~OTA-1167~~)

Petr Muller added a comment - 2024/02/15 12:24 PM The justification is that operators that go Available=False without good reason (its contract says If this is false, it means there is an outage. Someone is probably getting paged" ) cause a lot of noise especially during upgrades. It is one of the major problems for OSD, Chistoph Blecker : > So the biggest alerts by far that are painful are ClusterOperatorDown and ClusterOperatorDegraded. In particular the authentication and network operators flap a ton during the upgrade See also OSD-13696, OTA-700 and OCPSTRAT-835 . Upgrades are noisy and customers hate that. Fake blips make noise and make testing harder ( OTA-362 , OTA-1167 )

Stanislav Láznička (Inactive) added a comment - 2024/02/15 8:01 AM

We're going to need a better justification. Moved back to Normal.

Stanislav Láznička (Inactive) added a comment - 2024/02/15 8:01 AM We're going to need a better justification. Moved back to Normal.

Assignee:: Unassigned

Reporter:: W. Trevor King

QA Contact:: Deepak Punia (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/10/03 8:05 PM

Updated:: 2025/01/27 10:23 AM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Petr Muller added a comment - 2025/01/27 10:23 AM

Expand comment: Petr Muller added a comment - 2025/01/27 10:23 AM

Collapse comment: OpenShift Jira Bot added a comment - 2025/01/27 12:20 AM

Expand comment: OpenShift Jira Bot added a comment - 2025/01/27 12:20 AM

Collapse comment: W. Trevor King added a comment - 2024/10/28 9:00 PM

Expand comment: W. Trevor King added a comment - 2024/10/28 9:00 PM

Collapse comment: OpenShift Jira Bot added a comment - 2024/10/27 11:40 PM

Expand comment: OpenShift Jira Bot added a comment - 2024/10/27 11:40 PM

Collapse comment: Petr Muller added a comment - 2024/06/21 2:00 PM, Edited by Petr Muller - 2024/06/21 2:49 PM

Expand comment: Petr Muller added a comment - 2024/06/21 2:00 PM, Edited by Petr Muller - 2024/06/21 2:49 PM

Collapse comment: W. Trevor King added a comment - 2024/06/21 1:54 PM, Edited by W. Trevor King - 2024/06/21 1:54 PM

Expand comment: W. Trevor King added a comment - 2024/06/21 1:54 PM, Edited by W. Trevor King - 2024/06/21 1:54 PM

Collapse comment: W. Trevor King added a comment - 2024/04/10 10:11 PM

Expand comment: W. Trevor King added a comment - 2024/04/10 10:11 PM

Collapse comment: Scott Dodson added a comment - 2024/02/15 2:14 PM

Expand comment: Scott Dodson added a comment - 2024/02/15 2:14 PM

Collapse comment: Petr Muller added a comment - 2024/02/15 12:24 PM

Expand comment: Petr Muller added a comment - 2024/02/15 12:24 PM

Collapse comment: Stanislav Láznička (Inactive) added a comment - 2024/02/15 8:01 AM

Expand comment: Stanislav Láznička (Inactive) added a comment - 2024/02/15 8:01 AM

People

Dates