-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
Improve experience of stalled MCP rollouts
-
False
-
False
-
?
-
No
-
?
-
To Do
-
OCPSTRAT-180 - Improve upgrades - phase 1
-
?
-
OCPSTRAT-180Improve upgrades - phase 1
-
?
-
0% To Do, 0% In Progress, 100% Done
-
Undefined
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
Epic Goal
- Admins know why pools fail to drain successfully during a rollout
- Admins may define policy which ensures pool rollouts continue despite failure to drain a node
Why is this important?
- The parent Feature links to summaries of customer feedback sessions where this has come up
- Today it's not uncommon for MachineConfigPools other than the master pool to run into failure to rollout due to a blocked drain operation
-
- Can be triggered by workload policy such as PDB constraints
- Can be triggered by pods which fail to terminate due to product defects
- Today debugging such is roughly as follows
- Observe degraded pools or Upgradeable=False condition in 4.7 or 4.8 respectively
- Identify hung MCP
- Identified cordoned nodes
- Identify relevant mcd pod
- Review mcd logs, decide next action
- oc delete pod --force
- wait longer
- reboot node?
- call support, ask for help
- Most admins can't get from point A to point C and we don't provide documentation talking them through those steps either
Scenarios
- MCP rollout hung
- MCP clearly indicates which node and pod is inhibiting further progress
- Console surfaces this detail
- Admin defines policy which ensures pool rollout completes
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- ...
Dependencies (internal and external)
- ...
Previous Work (Optional):
- 4.7 made pools go Degraded when draining a node took longer than 1hr
- 4.8 added admin notice by way of setting Upgradeable=False with Reason: PoolUpdating, Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details
Open questions::
- Is it really safe to allow the admin to apply policy to override either PDB or failed to terminate pod evictions?
OpenShift Dedicated's Managed Upgrade Operator forces completion after waiting 1 hour and anecdotally says they've never had complaints.
However, things like OCS relies on PDBs to prevent data loss, if we start overriding PDBs it's very likely that customers will lose data and be very unhappy with us.
We should probably limit ourselves to first making it obvious and allowing the admin to intervene in both cases. Then consider moving forward only in the scenario where there is no PDB protecting the pod, ie: cases where we've deleted the pod but it hasn't terminated successfully.
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>