-
Feature Request
-
Resolution: Done
-
Minor
-
None
-
False
-
None
-
False
-
Not Selected
-
-
-
Proposed title of this feature request:
--> Intimate OCP users that the cluster in currently upgrading via a notification banner on the UI or an alert firing for the during of the upgrade.
What is the nature and description of the request?
--> Display a notification/banner throughout OCP web UI saying that the cluster is currently under upgrade.
--> Or have an alert firing for all the users of OCP stating the cluster is undergoing an upgrade.
Why does the customer need this? (List the business requirements here)
--> When cluster administrators perform an upgrade of the cluster this can have impact on the teams using the cluster. For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed. Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted. Applications may experience issues after the upgrade was finished because of incompatibility
--> When cluster administrators perform an upgrade of the cluster this can have impact on the teams using the cluster. For example maven pods can be evicted from the node being upgraded leading to a failed build, causing teams to waste time investigating why their build failed. Applications will be restarted unexpectedly when they are evicted from the node being upgraded, causing teams to waste time investigating why their application restarted. Applications may experience issues after the upgrade was finished because of incompatibility
List any affected packages or components.
--> None, UX of developers working on the cluster when it undergoes an upgrade.
- is blocked by
-
OTA-768 Notify users when platform is undergoing upgrade
-
- Closed
-
-
CONSOLE-3252 Inform OCP users that the platform is performing an upgrade
-
- Closed
-
Trying to understand the issues folks are facing:
How long do these builds take? In OpenShift CI, we have ~hours builds and CI jobs that we don't want to be evicted, so we put a maxUnavailable: 0 PodDisruptionBudget to say "please don't evict this pod, it will wrap up soon, and you can finish draining then". Nodes take longer to drain, but :shrug: that's what we're asking for.
What happens if they don't notice until the update completes and the alerts and banners are gone? Do they have access to ClusterVersion to check the cluster's recent update history? Do they have access to node events to look for the machine-config operator talking about draining and rebooting nodes? This is a 4.11.1 to 4.11.2 CI run, and events on a drained pod look like:
I agree that "why did the kubelet do that" is unclear from the event alone. But the node-linked events are there in the default namespace:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1562356644304654336/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "default" and .involvedObject.name == "ip-10-0-245-238.us-west-2.compute.internal") | .ts = .firstTimestamp // .metadata.creationTimestamp] | sort_by(.ts)[] | .ts + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep T09:55 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal ConfigDriftMonitorStopped: Config Drift Monitor stopped 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Cordon: Cordoned node to apply update 2022-08-24T09:55:49Z ip-10-0-245-238.us-west-2.compute.internal Drain: Draining node to update config. 2022-08-24T09:55:54Z ip-10-0-245-238.us-west-2.compute.internal NodeNotSchedulable: Node ip-10-0-245-238.us-west-2.compute.internal status is now: NodeNotSchedulable
This is another situation where update history (e.g. from ClusterVersion) seems more important that mid-update alerts/banners. And we work hard to try to make this sort of thing an up-front decision, with things like the 4.8-to-4.9 admin acks. Is this concern about API changes like that (in which case maybe the admin-ack flow needs improving), or is it about "any time something changes, there's a risk that some incompat leaked in" and you bump into it on patch updates as well?