-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
4.16.z, 4.18.z, 4.19
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
Yes
-
None
-
Proposed
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When applying a CGU with two clusters and max remediation of 1, where the first is expected to fail due to being powered off, the second should still succeed. However, this does not happen and .status.clusters.<should succeed>.currentPolicy is stuck NonCompliant even though the policy itself is compliant. Checking the TALM pod logs shows a repeating panic during reconcile that seems to be the source of the issue.
Here is the line of code causing the panic, the powered off cluster does not have an entry in the CurrentBatchRemediationProgress so this is a nil pointer and setting its FirstCompliantAt field is a nil pointer dereference.
Version-Release number of selected component (if applicable):
lastest brew versions of TALM for 4.16.z, 4.18.z, and 4.19
How reproducible:
always, showed in 4.16, 4.18, and 4.19 CI after introduced
Steps to Reproduce:
1. Create CGU with max concurrency of 1 and two clusters where the first cluster is powered off 2. Wait for the second cluster to complete (it won't even though policies are all compliant)
Actual results:
Neither batch succeeds and there is a panic in the pod logs
Expected results:
First batch fails due to timeout but the second batch succeeds
Additional info:
- is caused by
-
OCPBUGS-54348 TALM Soak Annotation evaluates single "FirstCompliantAt" per CGU instead of per Policy
-
- Closed
-
- relates to
-
OCPBUGS-54978 CGU says completed and all clusters compliant when first batch times out
-
- Closed
-
- links to
-
RHEA-2025:145129 OpenShift Container Platform 4.19.0 CNF vRAN extras update
- mentioned on