Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.15.z, 4.16.0
Component/s: Installer / Agent based installation
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:

4.15.z, 4.16.z
Target Version:

4.16.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, the `assisted-installer` did not reload new data from the `assisted-service` when the `assisted-installer` checked control plane nodes for readiness and a conflict existed with a write operation from the `assisted-installer-controller`. This conflict prevented the `assisted-installer` from detecting a node that was marked by the `assisted-installer-controller` as `Ready` because the `assisted-installer` relied on older information. With this release, the `assisted-installer` can receive the newest information from the `assisted-service`, so that it the `assisted-installer` can accurately detect the status of each node. (link:https://issues.redhat.com/browse/OCPBUGS-37167[*~~OCPBUGS-37167~~*])

Show
* Previously, the `assisted-installer` did not reload new data from the `assisted-service` when the `assisted-installer` checked control plane nodes for readiness and a conflict existed with a write operation from the `assisted-installer-controller`. This conflict prevented the `assisted-installer` from detecting a node that was marked by the `assisted-installer-controller` as `Ready` because the `assisted-installer` relied on older information. With this release, the `assisted-installer` can receive the newest information from the `assisted-service`, so that it the `assisted-installer` can accurately detect the status of each node. (link: https://issues.redhat.com/browse/OCPBUGS-37167 [* OCPBUGS-37167 *])

Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-36779~~. The following is the description of the original issue:
—
Previously, in ~~OCPBUGS-32105~~, we fixed a bug where a race between the assisted-installer and the assisted-installer-controller to mark a Node as Joined would result in 30+ minutes of (unlogged) retries by the former if the latter won. This was indistinguishable from the installation process hanging and it would eventually timed out.

This bug has been fixed, but we were unable to reproduce the circumstances that caused it.

However, a reproduction by the customer reveals another problem: we now correctly retry checking the control plane nodes for readiness if we encounter a conflict with another write from assisted-installer-controller. However, we never reload fresh data from assisted-service - data that would show the host has already been updated and thus prevent us from trying to update it again. Therefore, we continue to get a conflict on every retry. (This is at least now logged, so we can see what is happening.)

This also suggests a potential way to reproduce the problem: whenever one control plane node has booted to the point that the assisted-installer-controller is running before the second control plane node has booted to the point that the Node is marked as ready in the k8s API, there is a possibility of a race. There is in fact no need for the write from assisted-installer-controller to come in the narrow window between when assisted-installer reads vs. writes to the assisted-service API, because assisted-installer is always using a stale read.

blocks

OCPBUGS-38003 3rd master still not joining to the cluster on ABI

Closed

clones

OCPBUGS-36779 3rd master still not joining to the cluster on ABI

Closed

is blocked by

OCPBUGS-36779 3rd master still not joining to the cluster on ABI

Closed

is cloned by

OCPBUGS-38003 3rd master still not joining to the cluster on ABI

Closed

split from

OCPBUGS-32105 The third master is not joining to the cluster on an Agent Based Installations

Closed

links to

openshift/assisted-installer#881: [release-4.16] OCPBUGS-37167: Reload host inventory on conflict

RHBA-2024:5107 OpenShift Container Platform 4.16.z bug fix update

(2 links to)

Assignee:: Zane Bitter

Reporter:: OpenShift Prow Bot

QA Contact:: Manoj Hans

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/07/17 4:27 AM

Updated:: 2025/07/22 5:39 AM

Resolved:: 2024/08/13 9:55 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates