[RFE-3024] Imitate OCP users that the platform is performing an upgrade via alerts or notification banners

Type: Feature Request
Resolution: Done
Priority: Minor
Fix Version/s: openshift-4.12
Affects Version/s: None
Component/s: Over the Air, User Interface
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
PX Impact Score:
PX Priority Data:
PX Review Complete:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Proposed title of this feature request:

--> Intimate OCP users that the cluster in currently upgrading via a notification banner on the UI or an alert firing for the during of the upgrade.

What is the nature and description of the request?

--> Display a notification/banner throughout OCP web UI saying that the cluster is currently under upgrade.

--> Or have an alert firing for all the users of OCP stating the cluster is undergoing an upgrade.

Why does the customer need this? (List the business requirements here)

--> When cluster administrators perform an upgrade of the cluster this can have impact on the teams using the cluster. For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed. Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted. Applications may experience issues after the upgrade was finished because of incompatibility

List any affected packages or components.

--> None, UX of developers working on the cluster when it undergoes an upgrade.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

screenshot-1.png
17 kB
2022/09/08 5:26 PM
screenshot-2.png
32 kB
2022/09/08 5:28 PM
screenshot-3.png
20 kB
2022/09/08 5:28 PM

is blocked by

OTA-768 Notify users when platform is undergoing upgrade

Closed

CONSOLE-3252 Inform OCP users that the platform is performing an upgrade

Closed

W. Trevor King added a comment - 2022/08/24 7:03 PM

Trying to understand the issues folks are facing:

For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed.

How long do these builds take? In OpenShift CI, we have ~hours builds and CI jobs that we don't want to be evicted, so we put a maxUnavailable: 0 PodDisruptionBudget to say "please don't evict this pod, it will wrap up soon, and you can finish draining then". Nodes take longer to drain, but :shrug: that's what we're asking for.

Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted.

What happens if they don't notice until the update completes and the alerts and banners are gone? Do they have access to ClusterVersion to check the cluster's recent update history? Do they have access to node events to look for the machine-config operator talking about draining and rebooting nodes? This is a 4.11.1 to 4.11.2 CI run, and events on a drained pod look like:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version" and .reason == "Killing") | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[(length - 1)]'
{
  "apiVersion": "v1",
  "count": 2,
  "eventTime": null,
  "firstTimestamp": "2022-08-24T09:55:54Z",
  "involvedObject": {
    "apiVersion": "v1",
    "fieldPath": "spec.containers{cluster-version-operator}",
    "kind": "Pod",
    "name": "cluster-version-operator-8bbb8df8-wgbz5",
    "namespace": "openshift-cluster-version",
    "resourceVersion": "32430",
    "uid": "a9abb5a9-c3c1-40f5-b344-f49e46b3e042"
  },
  "kind": "Event",
  "lastTimestamp": "2022-08-24T09:55:57Z",
  "message": "Stopping container cluster-version-operator",
  "metadata": {
    "creationTimestamp": "2022-08-24T09:55:54Z",
    "name": "cluster-version-operator-8bbb8df8-wgbz5.170e3f56dd063148",
    "namespace": "openshift-cluster-version",
    "resourceVersion": "59267",
    "uid": "93fbbdbd-d626-407c-8864-dbc755b27ba5"
  },
  "reason": "Killing",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "kubelet",
    "host": "ip-10-0-245-238.us-west-2.compute.internal"
  },
  "type": "Normal",
  "ts": "2022-08-24T09:55:54Z"
}

I agree that "why did the kubelet do that" is unclear from the event alone. But the node-linked events are there in the default namespace:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "default" and .involvedObject.name == "ip-10-0-245-238.us-west-2.compute.internal") | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[] | .ts + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep T09:55
2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal ConfigDriftMonitorStopped: Config Drift Monitor stopped
2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Cordon: Cordoned node to apply update
2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Drain: Draining node to update config.
2022-08-24T09:55:54Z ip-10-0-245-238.us-west-2.compute.internal NodeNotSchedulable: Node ip-10-0-245-238.us-west-2.compute.internal status is now: NodeNotSchedulable

Applications may experience issues after the upgrade was finished because of incompatibility

This is another situation where update history (e.g. from ClusterVersion) seems more important that mid-update alerts/banners. And we work hard to try to make this sort of thing an up-front decision, with things like the 4.8-to-4.9 admin acks. Is this concern about API changes like that (in which case maybe the admin-ack flow needs improving), or is it about "any time something changes, there's a risk that some incompat leaked in" and you bump into it on patch updates as well?

W. Trevor King added a comment - 2022/08/24 7:03 PM Trying to understand the issues folks are facing: For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed. How long do these builds take? In OpenShift CI, we have ~hours builds and CI jobs that we don't want to be evicted, so we put a maxUnavailable: 0 PodDisruptionBudget to say "please don't evict this pod, it will wrap up soon, and you can finish draining then". Nodes take longer to drain, but :shrug: that's what we're asking for. Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted. What happens if they don't notice until the update completes and the alerts and banners are gone? Do they have access to ClusterVersion to check the cluster's recent update history? Do they have access to node events to look for the machine-config operator talking about draining and rebooting nodes? This is a 4.11.1 to 4.11.2 CI run, and events on a drained pod look like: $ curl -s https: //gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version" and .reason == "Killing" ) | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[(length - 1)]' { "apiVersion" : "v1" , "count" : 2, "eventTime" : null , "firstTimestamp" : "2022-08-24T09:55:54Z" , "involvedObject" : { "apiVersion" : "v1" , "fieldPath" : "spec.containers{cluster-version- operator }" , "kind" : "Pod" , "name" : "cluster-version- operator -8bbb8df8-wgbz5" , "namespace" : "openshift-cluster-version" , "resourceVersion" : "32430" , "uid" : "a9abb5a9-c3c1-40f5-b344-f49e46b3e042" }, "kind" : "Event" , "lastTimestamp" : "2022-08-24T09:55:57Z" , "message" : "Stopping container cluster-version- operator " , "metadata" : { "creationTimestamp" : "2022-08-24T09:55:54Z" , "name" : "cluster-version- operator -8bbb8df8-wgbz5.170e3f56dd063148" , "namespace" : "openshift-cluster-version" , "resourceVersion" : "59267" , "uid" : "93fbbdbd-d626-407c-8864-dbc755b27ba5" }, "reason" : "Killing" , "reportingComponent" : "", "reportingInstance" : "", "source" : { "component" : "kubelet" , "host" : "ip-10-0-245-238.us-west-2.compute.internal" }, "type" : "Normal" , "ts" : "2022-08-24T09:55:54Z" } I agree that "why did the kubelet do that" is unclear from the event alone. But the node-linked events are there in the default namespace: $ curl -s https: //gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == " default " and .involvedObject.name == "ip-10-0-245-238.us-west-2.compute.internal" ) | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[] | .ts + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep T09:55 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal ConfigDriftMonitorStopped: Config Drift Monitor stopped 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Cordon: Cordoned node to apply update 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Drain: Draining node to update config. 2022-08-24T09:55:54Z ip-10-0-245-238.us-west-2.compute.internal NodeNotSchedulable: Node ip-10-0-245-238.us-west-2.compute.internal status is now: NodeNotSchedulable Applications may experience issues after the upgrade was finished because of incompatibility This is another situation where update history (e.g. from ClusterVersion) seems more important that mid-update alerts/banners. And we work hard to try to make this sort of thing an up-front decision, with things like the 4.8-to-4.9 admin acks . Is this concern about API changes like that (in which case maybe the admin-ack flow needs improving), or is it about "any time something changes, there's a risk that some incompat leaked in" and you bump into it on patch updates as well?

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: W. Trevor King added a comment - 2022/08/24 7:03 PM

Expand comment: W. Trevor King added a comment - 2022/08/24 7:03 PM

People

Dates