Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-3024

Imitate OCP users that the platform is performing an upgrade via alerts or notification banners

    • False
    • None
    • False
    • Not Selected

      Proposed title of this feature request:

      --> Intimate OCP users that the cluster in currently upgrading via a notification banner on the UI or an alert firing for the during of the upgrade. 

      What is the nature and description of the request?

      --> Display a notification/banner throughout OCP web UI saying that the cluster is currently under upgrade.

      --> Or have an alert firing for all the users of OCP stating the cluster is undergoing an upgrade. 

      Why does the customer need this? (List the business requirements here)

      --> When cluster administrators perform an upgrade of the cluster this can have impact on the teams using the cluster. For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed. Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted. Applications may experience issues after the upgrade was finished because of incompatibility

      --> When cluster administrators perform an upgrade of the cluster this can have impact on the teams using the cluster. For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed. Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted. Applications may experience issues after the upgrade was finished because of incompatibility

       

      List any affected packages or components.

      --> None, UX of developers working on the cluster when it undergoes an upgrade.

       

       

       

        1. screenshot-1.png
          screenshot-1.png
          17 kB
        2. screenshot-2.png
          screenshot-2.png
          32 kB
        3. screenshot-3.png
          screenshot-3.png
          20 kB

            [RFE-3024] Imitate OCP users that the platform is performing an upgrade via alerts or notification banners

            Trying to understand the issues folks are facing:

            For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed.

            How long do these builds take? In OpenShift CI, we have ~hours builds and CI jobs that we don't want to be evicted, so we put a maxUnavailable: 0 PodDisruptionBudget to say "please don't evict this pod, it will wrap up soon, and you can finish draining then". Nodes take longer to drain, but :shrug: that's what we're asking for.

            Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted.

            What happens if they don't notice until the update completes and the alerts and banners are gone? Do they have access to ClusterVersion to check the cluster's recent update history? Do they have access to node events to look for the machine-config operator talking about draining and rebooting nodes? This is a 4.11.1 to 4.11.2 CI run, and events on a drained pod look like:

            $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version" and .reason == "Killing") | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[(length - 1)]'
            {
              "apiVersion": "v1",
              "count": 2,
              "eventTime": null,
              "firstTimestamp": "2022-08-24T09:55:54Z",
              "involvedObject": {
                "apiVersion": "v1",
                "fieldPath": "spec.containers{cluster-version-operator}",
                "kind": "Pod",
                "name": "cluster-version-operator-8bbb8df8-wgbz5",
                "namespace": "openshift-cluster-version",
                "resourceVersion": "32430",
                "uid": "a9abb5a9-c3c1-40f5-b344-f49e46b3e042"
              },
              "kind": "Event",
              "lastTimestamp": "2022-08-24T09:55:57Z",
              "message": "Stopping container cluster-version-operator",
              "metadata": {
                "creationTimestamp": "2022-08-24T09:55:54Z",
                "name": "cluster-version-operator-8bbb8df8-wgbz5.170e3f56dd063148",
                "namespace": "openshift-cluster-version",
                "resourceVersion": "59267",
                "uid": "93fbbdbd-d626-407c-8864-dbc755b27ba5"
              },
              "reason": "Killing",
              "reportingComponent": "",
              "reportingInstance": "",
              "source": {
                "component": "kubelet",
                "host": "ip-10-0-245-238.us-west-2.compute.internal"
              },
              "type": "Normal",
              "ts": "2022-08-24T09:55:54Z"
            }
            

            I agree that "why did the kubelet do that" is unclear from the event alone.  But the node-linked events are there in the default namespace:

            $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "default" and .involvedObject.name == "ip-10-0-245-238.us-west-2.compute.internal") | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[] | .ts + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep T09:55
            2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal ConfigDriftMonitorStopped: Config Drift Monitor stopped
            2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Cordon: Cordoned node to apply update
            2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Drain: Draining node to update config.
            2022-08-24T09:55:54Z ip-10-0-245-238.us-west-2.compute.internal NodeNotSchedulable: Node ip-10-0-245-238.us-west-2.compute.internal status is now: NodeNotSchedulable
            

            Applications may experience issues after the upgrade was finished because of incompatibility

            This is another situation where update history (e.g. from ClusterVersion) seems more important that mid-update alerts/banners. And we work hard to try to make this sort of thing an up-front decision, with things like the 4.8-to-4.9 admin acks. Is this concern about API changes like that (in which case maybe the admin-ack flow needs improving), or is it about "any time something changes, there's a risk that some incompat leaked in" and you bump into it on patch updates as well?

            W. Trevor King added a comment - Trying to understand the issues folks are facing: For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed. How long do these builds take? In OpenShift CI, we have ~hours builds and CI jobs that we don't want to be evicted, so we put a maxUnavailable: 0 PodDisruptionBudget to say "please don't evict this pod, it will wrap up soon, and you can finish draining then". Nodes take longer to drain, but :shrug: that's what we're asking for. Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted. What happens if they don't notice until the update completes and the alerts and banners are gone? Do they have access to ClusterVersion to check the cluster's recent update history? Do they have access to node events to look for the machine-config operator talking about draining and rebooting nodes? This is a 4.11.1 to 4.11.2 CI run, and events on a drained pod look like: $ curl -s https: //gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version" and .reason == "Killing" ) | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[(length - 1)]' { "apiVersion" : "v1" , "count" : 2, "eventTime" : null , "firstTimestamp" : "2022-08-24T09:55:54Z" , "involvedObject" : { "apiVersion" : "v1" , "fieldPath" : "spec.containers{cluster-version- operator }" , "kind" : "Pod" , "name" : "cluster-version- operator -8bbb8df8-wgbz5" , "namespace" : "openshift-cluster-version" , "resourceVersion" : "32430" , "uid" : "a9abb5a9-c3c1-40f5-b344-f49e46b3e042" }, "kind" : "Event" , "lastTimestamp" : "2022-08-24T09:55:57Z" , "message" : "Stopping container cluster-version- operator " , "metadata" : { "creationTimestamp" : "2022-08-24T09:55:54Z" , "name" : "cluster-version- operator -8bbb8df8-wgbz5.170e3f56dd063148" , "namespace" : "openshift-cluster-version" , "resourceVersion" : "59267" , "uid" : "93fbbdbd-d626-407c-8864-dbc755b27ba5" }, "reason" : "Killing" , "reportingComponent" : "", "reportingInstance" : "", "source" : { "component" : "kubelet" , "host" : "ip-10-0-245-238.us-west-2.compute.internal" }, "type" : "Normal" , "ts" : "2022-08-24T09:55:54Z" } I agree that "why did the kubelet do that" is unclear from the event alone.  But the node-linked events are there in the default namespace: $ curl -s https: //gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == " default " and .involvedObject.name == "ip-10-0-245-238.us-west-2.compute.internal" ) | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[] | .ts + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep T09:55 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal ConfigDriftMonitorStopped: Config Drift Monitor stopped 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Cordon: Cordoned node to apply update 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Drain: Draining node to update config. 2022-08-24T09:55:54Z ip-10-0-245-238.us-west-2.compute.internal NodeNotSchedulable: Node ip-10-0-245-238.us-west-2.compute.internal status is now: NodeNotSchedulable Applications may experience issues after the upgrade was finished because of incompatibility This is another situation where update history (e.g. from ClusterVersion) seems more important that mid-update alerts/banners. And we work hard to try to make this sort of thing an up-front decision, with things like the 4.8-to-4.9 admin acks . Is this concern about API changes like that (in which case maybe the admin-ack flow needs improving), or is it about "any time something changes, there's a risk that some incompat leaked in" and you bump into it on patch updates as well?

              rh-ee-smodeel Subin M
              rhn-support-simore Siddhant More (Inactive)
              Ju Lim
              Votes:
              3 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: