-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
Product / Portfolio Work
-
None
-
89% To Do, 11% In Progress, 0% Done
-
False
-
None
-
False
-
L
-
None
-
-
-
None
-
-
-
None
-
None
-
Undefined
Feature Overview (aka. Goal Summary)
This feature aims to improve the user experience for cluster upgrades by providing more accurate and reliable status information for Cluster Operators.
To ensure that operators behave predictably during upgrades and other cluster lifecycle events, and thus reduce confusion about the upgrade process.
This will be achieved by enforcing new rules for operator behavior and tracking compliance to address common issues reported by customers.
A successful upgrade path for a given ClusterOperator should follow a specific pattern:
- Before Upgrade starts, Operators are in a healthy, stable state: Available=True, Progressing=False, Degraded=False.
- When Upgrade starts the operator should become Progressing=True as it works to apply the new version.
It should not become Degraded=True or Available=False.
- After Upgrade, the operator should return to a state of Progressing=False, with Available=True and Degraded=False.
Goals (aka. expected user outcomes)
As a cluster-admin I want to get accurate information about the status of cluster operators and monitor the upgrade process without encountering false alarms.
- ClusterOperator must not report Degrade=True or Available=False during the course of a normal upgrade. OCP bugs haven been filed in this area [1].
- Operators MUST go progressing when transitioning between versions (We will need to decide how to enforce it. Ideas are here).
- Operators MUST NOT re-enter progressing state when simply observing node lifecycle events such as scaleup/scaledown or reboots – primarily an issue for operators that observe DaemonSets continuously: bugs coming from OTA-1637.
- OCP bugs will be filed for the cluster operator that take too long to upgrade OTA-1626.
Eventually, with these changes we should have Accurate status reporting, clear and clean upgrade progress information, Stable Operator status
The bugs in JIRA Dashboard : https://issues.redhat.com/secure/Dashboard.jspa?selectPageId=12390305 should be fixed to help improve UX of cluster upgrade.
New changed API rule
https://github.com/openshift/api/pull/2469/files
- A component must not report Available=False or Degrade=True during the course of a normal upgrade.
- A version change is a config change. Operators must go Progressing=True when transitioning between versions.
- Operators should not report Progressing only because DaemonSets owned by them are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade.
- A component in a cluster with less than 250 nodes must complete a version change within a limited period of time: 90 minutes for Machine Config Operator and 20 minutes for others.
Deployment considerations{}
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Both |
Classic (standalone cluster) | Applicable |
Hosted control planes | Applicable |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | Applicable |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All |
Operator compatibility | All core OpenShift operators |
Backport needed (list applicable versions) | TBD based on release schedule and customer demand. |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | The OpenShift Console Status command |
Other (please specify) | N/A |
Documentation Considerations
explain the new operator status behavior during upgrades. Degraded or Unavailable status during an upgrade is now a sign of a problem, not just a temporary state. mention the time limits for operator updates.
Background:
This Feature is a continuation of OCPSTRAT-835.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
Action for each component team
- Deliver the fixes for the relevant OCP bugs to improve UX of cluster upgrade.
JIRA Dashboard : https://issues.redhat.com/secure/Dashboard.jspa?selectPageId=12390305
References
- clones
-
OCPSTRAT-835 Improve upgrades - Reduce False Positives status from operators
-
- Closed
-
- is blocked by
-
OTA-1644 Improve upgrades - Deliver the fixes for the bugs about the sad cases of ClusterOperator status
-
- In Progress
-
- links to