Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Epic Name:
Improve experience of stalled MCP rollouts
Blocked:
False
Ready:
False
Dev Approval:
?
Discussed with Team:
No
Docs Approval:
?
Epic Status:
To Do
Feature Link:
OCPSTRAT-180 - Improve upgrades - phase 1
PM Approval:
?
Parent Link:
OCPSTRAT-180Improve upgrades - phase 1
QE Approval:
?
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:
Undefined

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Market:

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

Admins know why pools fail to drain successfully during a rollout
Admins may define policy which ensures pool rollouts continue despite failure to drain a node

Why is this important?

The parent Feature links to summaries of customer feedback sessions where this has come up
Today it's not uncommon for MachineConfigPools other than the master pool to run into failure to rollout due to a blocked drain operation

- Can be triggered by workload policy such as PDB constraints
- Can be triggered by pods which fail to terminate due to product defects
Today debugging such is roughly as follows
- Observe degraded pools or Upgradeable=False condition in 4.7 or 4.8 respectively
- Identify hung MCP
- Identified cordoned nodes
- Identify relevant mcd pod
- Review mcd logs, decide next action
  - oc delete pod --force
  - wait longer
  - reboot node?
  - call support, ask for help
Most admins can't get from point A to point C and we don't provide documentation talking them through those steps either

Scenarios

MCP rollout hung
MCP clearly indicates which node and pod is inhibiting further progress
Console surfaces this detail
Admin defines policy which ensures pool rollout completes

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

...

Previous Work (Optional):

4.7 made pools go Degraded when draining a node took longer than 1hr
4.8 added admin notice by way of setting Upgradeable=False with Reason: PoolUpdating, Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Open questions::

Is it really safe to allow the admin to apply policy to override either PDB or failed to terminate pod evictions?

OpenShift Dedicated's Managed Upgrade Operator forces completion after waiting 1 hour and anecdotally says they've never had complaints.
However, things like OCS relies on PDBs to prevent data loss, if we start overriding PDBs it's very likely that customers will lose data and be very unhappy with us.

We should probably limit ourselves to first making it obvious and allowing the admin to intervene in both cases. Then consider moving forward only in the scenario where there is no PDB protecting the pod, ie: cases where we've deleted the pod but it hasn't terminated successfully.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

is related to

RFE-1858 [RFE] Config to allow setting node drain timeout for node upgrades

Refinement

MCO-118 Sort Node Updates by Zone and Age

Closed

MCO-749 Improve experience of stalled MCP rollouts

Closed

RFE-3134 Warnings for blocking PodDisruptionBudgets

Closed

Details

Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide