Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20056

Single short-lived operand blip shouldn't cause authentication operator Available=False

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.15
    • apiserver-auth
    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Reviving bugzilla#2010539, the authentication ClusterOperator occasionally blips Available=False with reason=WellKnown_NotReady. For example, this run includes:

      : [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available expand_less	47m21s
      {  1 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:
      
      Oct 03 19:11:20.502 - 245ms E clusteroperator/authentication condition/Available reason/WellKnown_NotReady status/False WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://10.0.0.3:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.3:6443: i/o timeout
      

      While a dial timeout for the Kube API server isn't fantastic, an issue that only persists for 245ms is not long enough to warrant immediate admin intervention. Teaching the authentication operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

      OCPBUGS-32089 tracks narrowly denoising WellKnown_NotReady.  This ticket tracks more generic Available=False denoising.

      Version-Release number of selected component (if applicable):

      4.8, 4.10, and 4.15. Likely all supported versions of the authentication operator have this exposure.

      How reproducible:

      Looks like 10 to 50% of 4.15 runs have some kind of issue with authentication going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

      Steps to Reproduce:

      $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
      

      Actual results:

      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 44% failed, 13% of failures match = 6% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 67% failed, 17% of failures match = 11% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-heterogeneous (all) - 18 runs, 56% failed, 30% of failures match = 17% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-serial-aws-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-serial-ovn-ppc64le-powervs (all) - 6 runs, 100% failed, 33% of failures match = 33% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 67% failed, 25% of failures match = 17% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 50% failed, 33% of failures match = 17% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 86% of failures match = 36% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 21% failed, 76% of failures match = 16% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn-techpreview-serial (all) - 7 runs, 29% failed, 100% of failures match = 29% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 28% failed, 36% of failures match = 10% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 39% failed, 123% of failures match = 48% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 71 runs, 49% failed, 80% of failures match = 39% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 7 runs, 100% failed, 57% of failures match = 57% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 34% failed, 4% of failures match = 1% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-azure-sdn (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial (all) - 7 runs, 43% failed, 67% of failures match = 29% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-serial-ovn-ipv6 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
      periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
      periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 100% failed, 57% of failures match = 57% impact
      periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 12 runs, 58% failed, 14% of failures match = 8% impact
      

      Digging into reason and message frequency in 4.15-releated update CI:

      $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's/[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*/x.x.x.x/g;s|[.]apps[.][^/]*|.apps.../|g' | sort | uniq -c | sort -n
            1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout
            1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
            1 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
            1 authentication APIServices_Error rpc error: code = Unavailable desc = the connection is draining
            1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp: lookup oauth-openshift.apps.../
            1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp x.x.x.x:443: connect: connection refused
            1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://[fd02::410f]:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
            1 Nov 28 09:09:40.407 - 1s    E clusteroperator/authentication condition/Available reason/APIServerDeployment_PreconditionNotFulfilled status/False
            2 authentication APIServerDeployment_NoPod no .openshift-oauth-apiserver pods available on any node.
            2 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
            2 authentication APIServices_Error rpc error: code = Unknown desc = malformed header: missing HTTP content-type
            4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
            4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout
            6 authentication OAuthServerDeployment_NoDeployment deployment/openshift-authentication: could not be retrieved
            7 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
            7 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: i/o timeout (Client.Timeout exceeded while awaiting headers)
            8 authentication APIServerDeployment_NoPod no apiserver.openshift-oauth-apiserver pods available on any node.
            9 authentication APIServerDeployment_NoDeployment deployment/openshift-oauth-apiserver: could not be retrieved
            9 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": EOF
           11 authentication WellKnown_NotReady The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://x.x.x.x:6443/.well-known/oauth-authorization-server: dial tcp x.x.x.x:6443: i/o timeout
           23 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
           26 authentication APIServices_Error "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
           29 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
           29 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
           30 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: connect: connection refused
           34 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

      And simplifying by looking only at reason:

       curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
            1 authentication APIServerDeployment_PreconditionNotFulfilled
            6 authentication OAuthServerDeployment_NoDeployment
            8 authentication APIServerDeployment_NoDeployment
           10 authentication APIServerDeployment_NoPod
           11 authentication WellKnown_NotReady
           36 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
           43 authentication APIServices_PreconditionNotReady
           66 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable
           95 authentication APIServices_Error
      

       

      Expected results:

      Authentication goes Available=False if and only if immediate admin intervention is appropriate.

            [OCPBUGS-20056] Single short-lived operand blip shouldn't cause authentication operator Available=False

            curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].
            value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
                  3 authentication APIServices_PreconditionNotReady
                  3 authentication OAuthServerDeployment_NoDeployment
                  5 authentication WellKnown_NotReady
                  8 authentication APIServices_Error
                 14 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
                 16 authentication APIServerDeployment_NoDeployment
                 89 authentication APIServerDeployment_PreconditionNotFulfilled
            

            Petr Muller added a comment - curl -s 'https: //search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[]. value[].context[] ' | sed ' s|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n 3 authentication APIServices_PreconditionNotReady 3 authentication OAuthServerDeployment_NoDeployment 5 authentication WellKnown_NotReady 8 authentication APIServices_Error 14 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 16 authentication APIServerDeployment_NoDeployment 89 authentication APIServerDeployment_PreconditionNotFulfilled

            This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

            OpenShift Jira Bot added a comment - This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

            Still a problem in 4.18:

            $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/a
            uthentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | 
            sort | uniq -c | sort -n
                  1 authentication APIServerDeployment_NoPod
                  1 authentication APIServices_PreconditionNotReady
                  1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable
                  1 authentication WellKnown_NotReady
                  6 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
                 35 authentication APIServerDeployment_PreconditionNotFulfilled
            

            W. Trevor King added a comment - Still a problem in 4.18: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.18.*upgrade&context=0&search=clusteroperator/a uthentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n 1 authentication APIServerDeployment_NoPod 1 authentication APIServices_PreconditionNotReady 1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable 1 authentication WellKnown_NotReady 6 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 35 authentication APIServerDeployment_PreconditionNotFulfilled

            This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

            OpenShift Jira Bot added a comment - This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

            Petr Muller added a comment - - edited

            Linking several tests that can fail because of a Available=False blip, so that this card shows up in Sippy:

            [sig-arch][Early] Managed cluster should [apigroup:config.openshift.io] start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
            [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable before beginning upgrade [Early][Suite:upgrade]
            [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
            [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade] 
            

            Petr Muller added a comment - - edited Linking several tests that can fail because of a Available=False blip, so that this card shows up in Sippy: [sig-arch][Early] Managed cluster should [apigroup:config.openshift.io] start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable before beginning upgrade [Early][Suite:upgrade] [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

            W. Trevor King added a comment - - edited

            Checking in on the current 4.17 contributors:

            $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([
            ^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
                  1 authentication APIServerDeployment_PreconditionNotFulfilled
                  1 authentication ReadyIngressNodes_NoReadyIngressNodes
                  4 authentication OAuthServerDeployment_NoDeployment
                  5 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
                  9 authentication APIServerDeployment_NoPod
                 10 authentication APIServerDeployment_NoDeployment
                 20 authentication APIServices_PreconditionNotReady
                 93 authentication APIServices_Error
            

            And sorting by portions of the message details that seem relevant:

            $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&context=0&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep -o '503, err = the server is currently
             unable to handle the request\|Timeout exceeded while awaiting headers\|have no addresses with port name\|PreconditionNotReady\|deployment/[a-z-]*: could not be retrieved\|no [^ ]* pods available on any node' | sort | uniq -c | sort -n
                  1 no .openshift-oauth-apiserver pods available on any node
                  4 deployment/openshift-authentication: could not be retrieved
                  6 Timeout exceeded while awaiting headers
                  8 no apiserver.openshift-oauth-apiserver pods available on any node
                 10 deployment/openshift-oauth-apiserver: could not be retrieved
                 40 PreconditionNotReady
                 58 have no addresses with port name
                 67 503, err = the server is currently unable to handle the request
            

            So I'd guess retries here and/or here would help as the next leg of this effort.  Did you want me to spin that out into a separate bug, like we did with OCPBUGS-32089?

            W. Trevor King added a comment - - edited Checking in on the current 4.17 contributors: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([ ^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n 1 authentication APIServerDeployment_PreconditionNotFulfilled 1 authentication ReadyIngressNodes_NoReadyIngressNodes 4 authentication OAuthServerDeployment_NoDeployment 5 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable 9 authentication APIServerDeployment_NoPod 10 authentication APIServerDeployment_NoDeployment 20 authentication APIServices_PreconditionNotReady 93 authentication APIServices_Error And sorting by portions of the message details that seem relevant: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=48h&type=junit&context=0&name=4.17.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep -o '503, err = the server is currently unable to handle the request\|Timeout exceeded while awaiting headers\|have no addresses with port name\|PreconditionNotReady\|deployment/[a-z-]*: could not be retrieved\|no [^ ]* pods available on any node' | sort | uniq -c | sort -n 1 no .openshift-oauth-apiserver pods available on any node 4 deployment/openshift-authentication: could not be retrieved 6 Timeout exceeded while awaiting headers 8 no apiserver.openshift-oauth-apiserver pods available on any node 10 deployment/openshift-oauth-apiserver: could not be retrieved 40 PreconditionNotReady 58 have no addresses with port name 67 503, err = the server is currently unable to handle the request So I'd guess retries here  and/or  here would help as the next leg of this effort.  Did you want me to spin that out into a separate bug, like we did with OCPBUGS-32089 ?

            I've spun out OCPBUGS-32089 to cover just WellKnown_NotReady.  That way we have the new bug to land the auth-operator#664 improvement on. I've retitled this bug to be more clearly generic, and it's still the one referenced by the test-suite exception.

            W. Trevor King added a comment - I've spun out OCPBUGS-32089 to cover just WellKnown_NotReady .  That way we have the new bug to land the auth-operator#664 improvement on. I've retitled this bug to be more clearly generic, and it's still the one referenced by the test-suite exception .

            Correct, we have strategic initiatives aimed at getting to a point where we don't have a bunch of negative signal during upgrades. Service Delivery is telling us that the signal emitted during upgrades is so full of no action required alerts or negative operator conditions that they have to silence all monitoring which creates blind spots for the rare real problem during upgrades.

            Scott Dodson added a comment - Correct, we have strategic initiatives aimed at getting to a point where we don't have a bunch of negative signal during upgrades. Service Delivery is telling us that the signal emitted during upgrades is so full of no action required alerts or negative operator conditions that they have to silence all monitoring which creates blind spots for the rare real problem during upgrades.

            The justification is that operators that go Available=False without good reason (its contract says If this is false, it means there is an outage. Someone is probably getting paged") cause a lot of noise especially during upgrades. It is one of the major problems for OSD, Chistoph Blecker:

            > So the biggest alerts by far that are painful are ClusterOperatorDown and ClusterOperatorDegraded. In particular the authentication and network operators flap a ton during the upgrade

            See also OSD-13696, OTA-700 and OCPSTRAT-835. Upgrades are noisy and customers hate that. Fake blips make noise and make testing harder (OTA-362, OTA-1167)

            Petr Muller added a comment - The justification is that operators that go Available=False without good reason (its contract says If this is false, it means there is an outage. Someone is probably getting paged" ) cause a lot of noise especially during upgrades. It is one of the major problems for OSD, Chistoph Blecker : > So the biggest alerts by far that are painful are ClusterOperatorDown and ClusterOperatorDegraded. In particular the authentication and network operators flap a ton during the upgrade See also OSD-13696, OTA-700 and OCPSTRAT-835 . Upgrades are noisy and customers hate that. Fake blips make noise and make testing harder ( OTA-362 , OTA-1167 )

            We're going to need a better justification. Moved back to Normal.

            Stanislav Láznička (Inactive) added a comment - We're going to need a better justification. Moved back to Normal.

              Unassigned Unassigned
              trking W. Trevor King
              Deepak Punia Deepak Punia (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: