Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-339

Optimize & Model Minor Version Upgrade Duration

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • None
    • Optimize Minor Version Upgrade Duration
    • False
    • False
    • To Do
    • OCPPLAN-5484 - OpenShift 4 EUS to EUS upgrades
    • OCPPLAN-5484OpenShift 4 EUS to EUS upgrades
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined

      OCP/Telco Definition of Done
      Epic Template descriptions and documentation.

      <--- Cut-n-Paste the entire contents of this description into your new Epic --->

      Epic Goal

      • Record accurate expected durations for each minor upgrade step exclusive of Worker MCP rollout
      • Identify areas for improvement and measure those improvements

      Why is this important?

      • We've ignored upgrade duration for the last several releases to the point that we've had to relax existing CI tests which validated that upgrades complete within a certain time period
      • When customers prepare for an EUS 4.6 to a 4.10 upgrade it's important that it's as fast as possible and that we can predict for them how long the upgrade should take
      • We know there are portions of the upgrade which are slower than necessary including
        • DaemonSets which roll out serially when a canary and more parallel manner is safe to do so
        • Graceful shutdowns now releasing leases so that new controllers can immediate obtain a lease (MCO, others?)
        • Fatter images?
        • Higher cross operator parallelization?
        • Nothing should be off the table as long as it makes upgrades faster while introducing no additional risk

      Scenarios

      1. ...

      Acceptance Criteria

      • CI - Need a periodic job which tracks the upgrade waterfall of a cluster of moderate size with moderate workload so that we can track this at a macro level and identify when we regress
      • Stories on the boards for components which require optimization
      • A report of common issues for which we should amend design patterns to identify and prevent introduction of slack
      • A report which measures our success "ie: On a cluster with 12 nodes operating with 800 pods, we've been able to reduce the upgrade time by 7 minutes in each minor version upgrade across 4.6, 4.7, 4.8, 4.9, and 4.10"

      Dependencies (internal and external)

      1. To be populated as soon as we identify any areas for improvements

      Previous Work (Optional):

      Open questions::

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              Unassigned Unassigned
              rhn-support-sdodson Scott Dodson
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: