Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.13, 4.12
Component/s: Monitoring
Labels:
- trt

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

Reducing the alerting noise during upgrades is important enough that we want this to be fixed before GA.

Show
Reducing the alerting noise during upgrades is important enough that we want this to be fixed before GA.
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.13.0
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
Previously, the Kubernetes scheduler could skip scheduling certain pods for a node that received multiple restart operations. The {product-title} {product-version} counteracts this issue by including the `KubePodNotScheduled` alert for pods that cannot be scheduled within 30 minutes. (link:https://issues.redhat.com/browse/OCPBUGS-2260[*~~OCPBUGS-2260~~*])

Show
Previously, the Kubernetes scheduler could skip scheduling certain pods for a node that received multiple restart operations. The {product-title} {product-version} counteracts this issue by including the `KubePodNotScheduled` alert for pods that cannot be scheduled within 30 minutes. (link: https://issues.redhat.com/browse/OCPBUGS-2260 [* OCPBUGS-2260 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

~~TRT-594~~ investigates failed CI upgrade runs due to alert KubePodNotReady firing. The case was a pod getting skipped over for scheduling over two successive master node update / restarts. The case was determined valid so the ask is to be able to have the monitoring aware that master nodes are restarting and scheduling may be delayed. Presuming we don't want to change the existing tolerance for the non master node restart cases could we suppress it during those restarts and fall back to a second alert with increased tolerances only during those restarts, if we have metrics indicating we are restarting. Or similar if there are better ways to handle.

The scenario is:

A master node (1) is out of service during upgrade
A pod (A) is created but can not be scheduled due to anti-affinity rules as the other nodes already host a pod of that definition
A second pod (B) from the same definition is created after the first
Pod (A) attempts scheduling but fails as the master (1) node is still updating
Master (1) node completes updating
Pod (B) attempts scheduling and succeeds
Next Master (2) node begins updating
Pod (A) can not be scheduled on the next attempt(s) as the active master nodes already have pods placed and the next master (2) node is unavailable
Master (2) node completes updating
Pod (A) is scheduled

blocks

OCPBUGS-4431 KubePodNotReady - Increase Tolerance During Master Node Restarts

Closed

is cloned by

OCPBUGS-4431 KubePodNotReady - Increase Tolerance During Master Node Restarts

Closed

links to

openshift/cluster-monitoring-operator#1830: OCPBUGS-2260: add alert KubePodNotScheduled to group openshift-kubernetes.rules

Assignee:: Darragh Fitzmaurice

Reporter:: Forrest Babcock

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2022/10/11 2:51 PM

Updated:: 2025/07/29 5:31 AM

Resolved:: 2023/05/17 10:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide