Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-700

Ensure availability of all HA components during upgrades

    XMLWordPrintable

Details

    • Epic
    • Resolution: Done
    • Major
    • None
    • None
    • None
    • Ensure availability of all HA components during upgrades
    • False
    • False
    • In Progress
    • OCPSTRAT-835 - Improve upgrades - Reduce False Positives status from operators
    • Impediment
    • OCPSTRAT-835Improve upgrades - Reduce False Positives status from operators
    • 100
    • 100% 100%
    • Undefined

    Description

      Epic Goal

      • Eliminate the gap between measured availability and Available=true

      Why is this important?

      • Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
      • We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
      • We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
      • Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

      Scenarios

      1. In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
      2. Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
      3. Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
      4. Address all identified issues

      Acceptance Criteria

      • openshift/enhancements CONVENTIONS outlines these requirements
      • CI - Release blocking jobs include these new/updated tests
      • Release Technical Enablement - N/A if we do this we should need no docs
      • No outstanding identified issues

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      1. Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
        https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

      Open questions::

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • DEV - Tests in place
      • DEV - No outstanding failing tests
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

      Attachments

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              trking W. Trevor King
              rhn-support-sdodson Scott Dodson
              Jian Li Jian Li
              Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: