-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.13, 4.12
-
Important
-
None
-
Approved
-
False
-
-
-
Bug Fix
-
Done
TRT-594 investigates failed CI upgrade runs due to alert KubePodNotReady firing. The case was a pod getting skipped over for scheduling over two successive master node update / restarts. The case was determined valid so the ask is to be able to have the monitoring aware that master nodes are restarting and scheduling may be delayed. Presuming we don't want to change the existing tolerance for the non master node restart cases could we suppress it during those restarts and fall back to a second alert with increased tolerances only during those restarts, if we have metrics indicating we are restarting. Or similar if there are better ways to handle.
The scenario is:
- A master node (1) is out of service during upgrade
- A pod (A) is created but can not be scheduled due to anti-affinity rules as the other nodes already host a pod of that definition
- A second pod (B) from the same definition is created after the first
- Pod (A) attempts scheduling but fails as the master (1) node is still updating
- Master (1) node completes updating
- Pod (B) attempts scheduling and succeeds
- Next Master (2) node begins updating
- Pod (A) can not be scheduled on the next attempt(s) as the active master nodes already have pods placed and the next master (2) node is unavailable
- Master (2) node completes updating
- Pod (A) is scheduled
- blocks
-
OCPBUGS-4431 KubePodNotReady - Increase Tolerance During Master Node Restarts
- Closed
- is cloned by
-
OCPBUGS-4431 KubePodNotReady - Increase Tolerance During Master Node Restarts
- Closed
- links to