Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16.z, 4.18.z, 4.19
Component/s: TALM Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
Yes

Target Backport Versions:
None
Target Version:

4.19.0
Release Blocker:
Proposed
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    When applying a CGU with two clusters and max remediation of 1, where the first is expected to fail due to being powered off, the second should still succeed. However, this does not happen and .status.clusters.<should succeed>.currentPolicy is stuck NonCompliant even though the policy itself is compliant. Checking the TALM pod logs shows a repeating panic during reconcile that seems to be the source of the issue.

Here is the line of code causing the panic, the powered off cluster does not have an entry in the CurrentBatchRemediationProgress so this is a nil pointer and setting its FirstCompliantAt field is a nil pointer dereference.

Version-Release number of selected component (if applicable):

    lastest brew versions of TALM for 4.16.z, 4.18.z, and 4.19

How reproducible:

    always, showed in 4.16, 4.18, and 4.19 CI after introduced

Steps to Reproduce:

covered by automation

1. Create CGU with max concurrency of 1 and two clusters where the first cluster is powered off
2. Wait for the second cluster to complete (it won't even though policies are all compliant)

Actual results:

    Neither batch succeeds and there is a panic in the pod logs

Expected results:

First batch fails due to timeout but the second batch succeeds

Additional info:

google drive with all available logs

is caused by

OCPBUGS-54348 TALM Soak Annotation evaluates single "FirstCompliantAt" per CGU instead of per Policy

Closed

relates to

OCPBUGS-54978 CGU says completed and all clusters compliant when first batch times out

Closed

links to

openshift-kni/cluster-group-upgrades-operator#1060: OCPBUGS-54738: Fix panic when cluster is not in current batch

RHEA-2025:145129 OpenShift Container Platform 4.19.0 CNF vRAN extras update

mentioned on

Merge request - Updated US source to: 2e21f51 IBGU: Cache only ManifestWorks that are created by TALM (#1065)

Merge request - Updated US source to: 7c6d919 OCPBUGS-54738: Fix panic when cluster is not in current batch (#1060)

Merge request - Updated US source to: 64ece8f IBGU: Cache only ManifestWorks that are created by TALM (#1066)

Merge request - Updated US source to: 01760f4 OCPBUGS-54738: Fix panic when cluster is not in current batch (#1062)

(3 mentioned on)

Assignee:: Saeid Askari

Reporter:: Kirsten Laskoski

Need Info From:: None

Contributors:: None

QA Contact:: Kirsten Laskoski

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/04/08 7:25 PM

Updated:: 2025/07/14 1:19 PM

Resolved:: 2025/06/17 6:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates