Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.17.0
Component/s: Installer / Agent based installation
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:

4.17.z, 4.16.z
Target Version:

4.17.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, when a worker node was trying to join a cluster the rendezvous node rebooted before the process completed. As the worker node could not communicate as expected with the rendezvous node, the installation was not successful. With this release, a patch is applied that fixes the racing condition that caused the rendezvous node to reboot prematurely and the issue is resolved.
(link:https://issues.redhat.com/browse/OCPBUGS-41811 [*~~OCPBUGS-41811~~*])

Show
* Previously, when a worker node was trying to join a cluster the rendezvous node rebooted before the process completed. As the worker node could not communicate as expected with the rendezvous node, the installation was not successful. With this release, a patch is applied that fixes the racing condition that caused the rendezvous node to reboot prematurely and the issue is resolved. (link: https://issues.redhat.com/browse/OCPBUGS-41811 [* OCPBUGS-41811 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-41811~~. The following is the description of the original issue:
—
Component Readiness has found a potential regression in the following test:

operator conditions network

Probability of significant regression: 99.42%

Sample (being evaluated) Release: 4.17
Start Time: 2024-09-04T00:00:00Z
End Time: 2024-09-11T23:59:59Z
Success Rate: 60.00%
Successes: 6
Failures: 4
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 22
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Scheduler=default&SecurityMode=default&Suite=parallel&Suite=parallel&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-28%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Networking%20%2F%20cluster-network-operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20parallel%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-11%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-09-04%2000%3A00%3A00&testId=Operator%20results%3A4b5f6af893ad5577904fbaec3254506d&testName=operator%20conditions%20network

The bug is being opened against the component we see the regression for. However we see 4 out of 10 jobs have failed install with missing worker nodes.

Slack thread

Had a look at the agent-gather and found just the worker ones (as expected, since they did not join the bootstrap). From the journal the workers were able to fetch the ignition, ie
...
set 09 21:19:27 worker-0 assisted-installer[2823]: time="2024-09-09T19:19:27Z" level=info msg="Getting ignition from https://192.168.111.5:22623/config/worker"
...
and ready to reboot:
...
set 09 21:20:31 worker-0 assisted-installer[2823]: time="2024-09-09T19:20:31Z" level=info msg="Uploading logs and reporting status before rebooting the node 2cc1856b-c4f5-4e3e-9117-128cf97e1d15 for cluster abe52fa7-9e7e-465e-a5c5-457ecb49bb70"
...
But the reboot never happened, and they remaining stuck (thus not completing the joining procedure). The reason of the stuck it's not yet clear

blocks

OCPBUGS-51362 4.17 Failed workers reboot in HA topology prevents cluster deployment completion

Closed

clones

OCPBUGS-41811 4.17 Failed workers reboot in HA topology prevents cluster deployment completion

Closed

is blocked by

OCPBUGS-41811 4.17 Failed workers reboot in HA topology prevents cluster deployment completion

Closed

is cloned by

OCPBUGS-51362 4.17 Failed workers reboot in HA topology prevents cluster deployment completion

Closed

links to

openshift/assisted-installer#1028: [release-4.17] OCPBUGS-50011: (agent-based installer) let the bootstrap wait for workers before the reboot

RHBA-2025:1912 OpenShift Container Platform 4.17.z bug fix update

(1 links to)

Assignee:: Pedro Jose Amoedo Martinez

Reporter:: OpenShift Prow Bot

Need Info From:: None

Contributors:: None

QA Contact:: zhenying niu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/02/07 5:59 PM

Updated:: 2025/07/16 1:22 PM

Resolved:: 2025/03/05 3:52 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates