Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2484

Improve upgrade experience - fix false alarms in ClusterOperator status

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • 89% To Do, 11% In Progress, 0% Done
    • False
    • None
    • False
    • L
    • None
    • None
    • None
    • Undefined

      Feature Overview (aka. Goal Summary)

      This feature aims to improve the user experience for cluster upgrades by providing more accurate and reliable status information for Cluster Operators.
      To ensure that operators behave predictably during upgrades and other cluster lifecycle events, and thus reduce confusion about the upgrade process.
      This will be achieved by enforcing new rules for operator behavior and tracking compliance to address common issues reported by customers.

      A successful upgrade path for a given ClusterOperator should follow a specific pattern:

      1. Before Upgrade starts, Operators are in a healthy, stable state: Available=True, Progressing=False, Degraded=False.
      1. When Upgrade starts the operator should become Progressing=True as it works to apply the new version.
        It should not become Degraded=True or Available=False.
      1. After Upgrade, the operator should return to a state of Progressing=False, with Available=True and Degraded=False.

      Goals (aka. expected user outcomes)

      As a cluster-admin I want to get accurate information about the status of cluster operators and monitor the upgrade process without encountering false alarms.

      • ClusterOperator must not report Degrade=True or Available=False during the course of a normal upgrade. OCP bugs haven been filed in this area [1].
      • Operators MUST go progressing when transitioning between versions (We will need to decide how to enforce it. Ideas are here).
      • Operators MUST NOT re-enter progressing state when simply observing node lifecycle events such as scaleup/scaledown or reboots – primarily an issue for operators that observe DaemonSets continuously: bugs coming from OTA-1637.
      • OCP bugs will be filed for the cluster operator that take too long to upgrade OTA-1626.

      Eventually, with these changes we should have Accurate status reporting, clear and clean upgrade progress information, Stable Operator status

      The bugs in JIRA Dashboard : https://issues.redhat.com/secure/Dashboard.jspa?selectPageId=12390305 should be fixed to help improve UX of cluster upgrade.

      New changed API rule

      https://github.com/openshift/api/pull/2469/files

      • A component must not report Available=False or Degrade=True during the course of a normal upgrade.
      • A version change is a config change. Operators must go Progressing=True when transitioning between versions.
      • Operators should not report Progressing only because DaemonSets owned by them are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade.
      • A component in a cluster with less than 250 nodes must complete a version change within a limited period of time: 90 minutes for Machine Config Operator and 20 minutes for others.

      Deployment considerations{}

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both Both
      Classic (standalone cluster) Applicable
      Hosted control planes Applicable
      Multi node, Compact (three node), or Single node (SNO), or all All
      Connected / Restricted Network Applicable
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All
      Operator compatibility All core OpenShift operators
      Backport needed (list applicable versions) TBD based on release schedule and customer demand.
      UI need (e.g. OpenShift Console, dynamic plugin, OCM) The OpenShift Console  
      Status command
      Other (please specify) N/A

      Documentation Considerations

      explain the new operator status behavior during upgrades. Degraded or Unavailable status during an upgrade is now a sign of a problem, not just a temporary state. mention the time limits for operator updates.

      Background:

      This Feature is a continuation of OCPSTRAT-835.
      Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.

      Action for each component team

      1. Deliver the fixes for the relevant OCP bugs to improve UX of cluster upgrade.
        JIRA Dashboard : https://issues.redhat.com/secure/Dashboard.jspa?selectPageId=12390305

      References

              rh-ee-smodeel Subin M
              rh-ee-smodeel Subin M
              None
              Hongkai Liu, Scott Dodson, W. Trevor King
              Scott Dodson Scott Dodson
              None
              Avani Bhatt Avani Bhatt
              Eric Rich Eric Rich
              Hongkai Liu Hongkai Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: