Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-810

Draft Admin Centered Upgrade Documentation Phase 3

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Major Major
    • openshift-4.14
    • None
    • None
    • Draft Admin Centered Upgrade Documentation Phase 3
    • BU Product Work
    • False
    • False
    • To Do
    • OCPSTRAT-180 - Improve upgrades - phase 1
    • Impediment
    • OCPSTRAT-180Improve upgrades - phase 1
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined

      OCP/Telco Definition of Done
      Epic Template descriptions and documentation.

      <--- Cut-n-Paste the entire contents of this description into your new Epic --->

      Epic Goal

      • Revamp our Upgrade Documentation to include an appropriate level of detail for admins

      Why is this important?

      • Currently Admins have nothing which explains to them how upgrades actually work and as a result when things don't go perfectly they panic
      • We do not sufficiently, or at least within context of Upgrade Docs, explain the differences between Degraded and Available statuses
      • We do not explain order of operations
      • We do not explain protections built into the platform which protect against total cluster failure, ie halting when components do not return to healthy state within exp

      Scenarios

      1. Move out channel management to its own chapter
      2. Explain or link to existing documentation which addresses the differences between Degraded=True and Available=False
      3. Explain Upgradeable=False conditions and other aspects of upgrade preflight strategy that Operators should be indicating when its unsafe to upgrade
      4. Explain basics of how the upgrade is applied
        1. CVO fetches release image
        2. CVO updates operators in the following order
        3. Each operator is expected to monitor for success
        4. Provide example ordering of manifests and command to extract release specific manifests and infer the ordering
      5. Explain how operators indicate problems and generic processes for investigating them
      6. Explain the special role of MCO and MCP mechanisms such as pausing pools
      7. Provide some basic guidance for Control Plane duration, that is exclude worker pool rollout duration (90-120 minutes is normal)

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • ...

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      1. There was an effort to write up how to use MachineConfig Pools to partition and optimize worker rollout in https://issues.redhat.com/browse/OTA-375

      Open questions::

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

            lmohanty@redhat.com Lalatendu Mohanty
            rhn-support-sdodson Scott Dodson
            Evgeni Vakhonin Evgeni Vakhonin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: