-
Epic
-
Resolution: Won't Do
-
Normal
-
None
-
None
-
None
-
None
-
Impact of upgrade edge gating
-
False
-
None
-
False
-
Not Selected
-
To Do
As a member of the OTA team
I understand the impact of our new edge gating automation (graph-data#2059, example)
So that I can either improve it or stand in awe of it's awesomeness.
Background:
After a few years of managing the upgrade graph we've learned to be careful with how we introduce new edges. Without care we can lead customers to temporary "deadend" release. A deadend release is an errata release that would normally allow you to upgrade to the next minor but "surprisingly" does not yet. Consider the following fictitious example:
- Current update recommendations are:
- A.1 -> A.2
- A.1 -> A.3
- A.2 -> A.3
- A.2 -> B.1
- But no update recommendation from A.1 to B.1, because A.1 is missing a pre-update guard that was added in A.2.
- And no update recommendation from A.3 to B.1, because A.3 includes a newer fix, which would regress when updating to B.1.
- A customer installed version A.1. They want to upgrade to version B.
- In order to get there they must first upgrade to a newer version of A since A.1 has no paths to B.
- They upgrade to A.3 since that's the latest version of A available and the stable upgrades from A to B have been out for months.
- Surprisingly there's no upgrade available from A.3 to B yet because, at the last minute, the engineering team discovered a problem in the latest B.2 release. The release was actually "tombstoned" or never released.
With the automation that has now landed (OTA-694) we've automated a previously-manual practice to avoid the scenario above. If a customer were to first change to the B channel before starting their upgrade we would instead upgrade them from A.1 to A.2. A.2 does have an upgrade path to a release in version B so they are just one more click away from getting there.
Challenges:
- To analyze the historical upgrade graph, you can use show-edges.py. It's all in Git, so you can also get at it directly without going through Python.
- It will be easy to make this problem much harder than it needs to be. To consider the impact of our automation, I believe we only need to look at upgrades where:
- The cluster was in a 4.y.z, in a *-4.(y+) channel (e.g. 4.6.50 in eus-4.8).
- The cluster updated to their 4.y's then-tip 4.y.z' (e.g. 4.6.60. Doesn't actually need to have been the tip, the dead-end could be multiple releases long, but in practice, it seems unlikely that folks are selecting specific non-tip dead-end targets).
- That target 4.y.z' had no update available to 4.(y+) at that time (so it was then a dead end).
Note:
This is a potential project for Maaz. I propose he discuss it with the team on the August 15th team meeting.
Definition of done:
- Find out how many customers faced the issue of upgrading to the tip of a 4.y.z stream (before we changed the graph automation) and end up in a place where there are no upgrade path to 4.y+1. We can assume that customers in the 4.y+1 channel or 4.y+2 EUS channels has the intent of upgrading to the next minor version.
- We should find out how many times we have added versions to the channels (as cincinnati-graph-data commits) where are no upgrade path to the next minor versions.
- is caused by
-
OTA-694 stabilization-bot: do not include 4.(y-1).z and earlier until they can get to 4.y
- Closed