Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1578

Ensure all HA components are not degraded by design during upgrades

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • Ensure all HA components are not degraded by design during upgrades
    • False
    • False
    • In Progress
    • OCPSTRAT-1064 - Improve upgrades - phase 3 - Control plane & worker node independence
    • Impediment
    • OCPSTRAT-1064Improve upgrades - phase 3 - Control plane & worker node independence
    • 16
    • 16% 16%
    • Undefined

      Epic Goal

      • Eliminate the gap between measured availability and Available=true

      Why is this important?

      • Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
      • We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
      • We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
      • Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

      Scenarios

      1. In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
      2. Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
      3. Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
      4. Address all identified issues

      Acceptance Criteria

      • openshift/enhancements CONVENTIONS outlines these requirements
      • CI - Release blocking jobs include these new/updated tests
      • Release Technical Enablement - N/A if we do this we should need no docs
      • No outstanding identified issues

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      1. Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
        https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

      Open questions::

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • DEV - Tests in place
      • DEV - No outstanding failing tests
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

        1.
        QE Tracker Sub-task New Undefined Unassigned
        2.
        TE Tracker Sub-task New Undefined Unassigned

            trking W. Trevor King
            rhn-support-sdodson Scott Dodson
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: