Loading...

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Minor
Fix Version/s: 4.11.z
Affects Version/s: 4.11.z
Component/s: RHCOS
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:
A cluster was upgrading from 4.10.45 to 4.11.25 but it got stuck with machine-config operator going down due to one of the machine-config-daemon pods crashing with:

F0206 23:04:46.540997  883811 start.go:154] Failed to initialize single run daemon: error initializing rpm-ostree: error running systemctl start rpm-ostreed: Job for rpm-ostreed.service failed because the control process exited with error code.                                                                          
See "systemctl status rpm-ostreed.service" and "journalctl -xe" for details.                                                                                                                                                                                                                                                  
: exit status 1

trking also found this from the affected node logs:

$ zgrep -o ' rpm-ostree[[].*' ip-10-44-225-172.ec2.internal.log.gz | sed 's/^[^:]*: //' | sort | uniq -c | sort -n | tail
   2 In idle state; will auto-exit in 62 seconds
   2 In idle state; will auto-exit in 64 seconds
   2 Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful
   3 In idle state; will auto-exit in 63 seconds
   3 Locked sysroot
   3 Reading config file '/etc/rpm-ostreed.conf'
   3 Unlocked sysroot
   5 In idle state; will auto-exit in 60 seconds
   7 In idle state; will auto-exit in 61 seconds
   68 error: The maximum number of active connections for UID 0 has been reached

This is similar to https://bugzilla.redhat.com/show_bug.cgi?id=2111817#c22 however https://github.com/openshift/machine-config-operator/pull/3292/files means that the workaround should have already been applied and we (managed OpenShift SRE) should not have received an alert for the machine config operator going down

Draining the affected worker node did help with the MCD pod being able to start running again.

Version-Release number of selected component (if applicable):

4.11.25 (while upgrading from 4.10.45)

Assignee:: Unassigned

Reporter:: Karthik Perumal

Need Info From:: Karthik Perumal

Contributors:: None

QA Contact:: Rio Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2023/02/07 12:49 AM

Updated:: 2025/07/28 5:36 AM

Resolved:: 2024/03/13 8:01 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide