Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-700

Ensure availability of all HA components during upgrades


    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • Ensure availability of all HA components during upgrades
    • False
    • False
    • In Progress
    • OCPSTRAT-835 - Improve upgrades - Reduce False Positives status from operators
    • Impediment
    • OCPSTRAT-835Improve upgrades - Reduce False Positives status from operators
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined

      Epic Goal

      • Eliminate the gap between measured availability and Available=true

      Why is this important?

      • Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
      • We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
      • We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
      • Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.


      1. In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
      2. Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
      3. Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
      4. Address all identified issues

      Acceptance Criteria

      • openshift/enhancements CONVENTIONS outlines these requirements
      • CI - Release blocking jobs include these new/updated tests
      • Release Technical Enablement - N/A if we do this we should need no docs
      • No outstanding identified issues

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      1. Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10

      Open questions::

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • DEV - Tests in place
      • DEV - No outstanding failing tests
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

        QE Tracker Sub-task Closed Undefined Unassigned
        TE Tracker Sub-task Closed Undefined Unassigned

            trking W. Trevor King
            rhn-support-sdodson Scott Dodson
            Jian Li Jian Li
            2 Vote for this issue
            13 Start watching this issue