Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Etcd
Labels:
- rits-work

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Customer has multiple clusters running on OpenShift v4.16.36 and v4.16.38

They have a huge fleet of ansible pods running on the clusters. Post upgrading from 4.15 to 4.16.36/38, they started observing periodic pods going into containerCreating state.

The nodes on several occasions go into Not Ready state and then later replaced by MachineHealthCheck.

This is happening fleetwide. Customer is using Azure RedHat OpenShift.

We can see periodic spikes in the etcd and logs of the application indicate:

`"error":"leader election lost"`

Checking the API calls made in the last 2 days on the cluster, the following result were concerning:

user_username	                                                                                     verb	occurences

system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount	             update	1820155
system:serviceaccount:kube-system:deployment-controller                         	             update	1212919
system:serviceaccount:openshift-machine-config-operator:machine-config-operator	                     get	1077703
system:serviceaccount:openshift-gitops-operator:openshift-gitops-operator-controller-manager	     update	895522
system:serviceaccount:aap-jobs:aap-jobs	                                                             get	760996

I am adding a must-gather and sos-report from the cluster as well.

Could this be a regression of: https://issues.redhat.com/browse/OCPBUGS-48696 ?

Assignee:: Ben Luddy

Reporter:: Madhusudan Upadhyay

QA Contact:: Ge Liu

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2025/05/19 1:02 PM

Updated:: 2025/09/14 12:53 AM

Resolved:: 2025/07/16 6:32 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates