Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.12.z
Affects Version/s: 4.13, 4.12
Component/s: Monitoring
Labels:
- trt

Severity:
Important
Regression:
None
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previosuly, cluster administrators could not distinguish between a pod being not ready because of a scheduling issue and a pod being not ready because the `kubelet` could not start a pod. In both cases, the `KubePodNotReady` alert would start.

The {product-title} {product-version} release improves the clarity of pod alerts in the following ways:

* The `KubePodNotScheduled` alert now starts when a pod is not ready because of a scheduling issue.
* The `KubePodNotReady` alert starts when `kubelet` could not start a pod.

(link:https://issues.redhat.com/browse/OCPBUGS-4431[*~~OCPBUGS-4431~~*])

Show
Previosuly, cluster administrators could not distinguish between a pod being not ready because of a scheduling issue and a pod being not ready because the `kubelet` could not start a pod. In both cases, the `KubePodNotReady` alert would start. The {product-title} {product-version} release improves the clarity of pod alerts in the following ways: * The `KubePodNotScheduled` alert now starts when a pod is not ready because of a scheduling issue. * The `KubePodNotReady` alert starts when `kubelet` could not start a pod. (link: https://issues.redhat.com/browse/OCPBUGS-4431 [* OCPBUGS-4431 *])
Release Note Status:
Done
Target Version:

4.12.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-2260~~. The following is the description of the original issue:
—
~~TRT-594~~ investigates failed CI upgrade runs due to alert KubePodNotReady firing. The case was a pod getting skipped over for scheduling over two successive master node update / restarts. The case was determined valid so the ask is to be able to have the monitoring aware that master nodes are restarting and scheduling may be delayed. Presuming we don't want to change the existing tolerance for the non master node restart cases could we suppress it during those restarts and fall back to a second alert with increased tolerances only during those restarts, if we have metrics indicating we are restarting. Or similar if there are better ways to handle.

The scenario is:

A master node (1) is out of service during upgrade
A pod (A) is created but can not be scheduled due to anti-affinity rules as the other nodes already host a pod of that definition
A second pod (B) from the same definition is created after the first
Pod (A) attempts scheduling but fails as the master (1) node is still updating
Master (1) node completes updating
Pod (B) attempts scheduling and succeeds
Next Master (2) node begins updating
Pod (A) can not be scheduled on the next attempt(s) as the active master nodes already have pods placed and the next master (2) node is unavailable
Master (2) node completes updating
Pod (A) is scheduled

clones

OCPBUGS-2260 KubePodNotReady - Increase Tolerance During Master Node Restarts

Closed

is blocked by

OCPBUGS-2260 KubePodNotReady - Increase Tolerance During Master Node Restarts

Closed

links to

openshift/cluster-monitoring-operator#1837: [release-4.12] OCPBUGS-4431: add alert KubePodNotScheduled to group openshift-kubernetes.rules

Assignee:: Simon Pasquier

Reporter:: OpenShift Prow Bot

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/12/02 4:11 PM

Updated:: 2024/02/15 3:32 PM

Resolved:: 2023/01/17 7:44 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates