Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-743

What impact does upgrade edge gating have on the cluster fleet?

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Won't Do
    • Icon: Normal Normal
    • None
    • None
    • None
    • None
    • Impact of upgrade edge gating
    • False
    • None
    • False
    • Not Selected
    • To Do

      As a member of the OTA team

      I understand the impact of our new edge gating automation (graph-data#2059, example)

      So that I can either improve it or stand in awe of it's awesomeness.

       

      Background:

      After a few years of managing the upgrade graph we've learned to be careful with how we introduce new edges.  Without care we can lead customers to temporary "deadend" release.  A deadend release is an errata release that would normally allow you to upgrade to the next minor but "surprisingly" does not yet.  Consider the following fictitious example:

      1. Current update recommendations are:
        1. A.1 -> A.2
        2. A.1 -> A.3
        3. A.2 -> A.3
        4. A.2 -> B.1
        5. But no update recommendation from A.1 to B.1, because A.1 is missing a pre-update guard that was added in A.2.
        6. And no update recommendation from A.3 to B.1, because A.3 includes a newer fix, which would regress when updating to B.1.
      2. A customer installed version A.1.  They want to upgrade to version B.
      3. In order to get there they must first upgrade to a newer version of A since A.1 has no paths to B.
      4. They upgrade to A.3 since that's the latest version of A available and the stable upgrades from A to B have been out for months.
      5. Surprisingly there's no upgrade available from A.3 to B yet because, at the last minute, the engineering team discovered a problem in the latest B.2 release.  The release was actually "tombstoned" or never released.

      With the automation that has now landed (OTA-694) we've automated a previously-manual practice to avoid the scenario above.  If a customer were to first change to the B channel before starting their upgrade we would instead upgrade them from A.1 to A.2.  A.2 does have an upgrade path to a release in version B so they are just one more click away from getting there.

      Challenges:

      • To analyze the historical upgrade graph, you can use show-edges.py.  It's all in Git, so you can also get at it directly without going through Python.
      • It will be easy to make this problem much harder than it needs to be.  To consider the impact of our automation, I believe we only need to look at upgrades where:
        • The cluster was in a 4.y.z, in a *-4.(y+) channel (e.g. 4.6.50 in eus-4.8).
        • The cluster updated to their 4.y's then-tip 4.y.z' (e.g. 4.6.60.  Doesn't actually need to have been the tip, the dead-end could be multiple releases long, but in practice, it seems unlikely that folks are selecting specific non-tip dead-end targets).
        • That target 4.y.z' had no update available to 4.(y+) at that time (so it was then a dead end).

      Note:

      This is a potential project for Maaz.  I propose he discuss it with the team on the August 15th team meeting.

      Definition of done:

      • Find out how many customers faced the issue of upgrading to the tip of a 4.y.z stream (before we changed the graph automation) and end up in a place where there are no upgrade path to 4.y+1. We can assume that customers in the 4.y+1 channel or 4.y+2 EUS channels  has the intent of upgrading to the next minor version.
      • We should find out how many times we have added versions to the channels (as cincinnati-graph-data commits) where are no upgrade path to the next minor versions.

              rh-ee-mashaikh Maaz Shaikh (Inactive)
              rh-ee-bleanhar Brenton Leanhardt
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: