-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.13.z
-
Important
-
No
-
CNF RAN Sprint 238, CNF RAN Sprint 239, CNF RAN Sprint 240
-
3
-
False
-
-
Description of problem:
When multiple clusters are specified in a CGU, if one cluster is offline policies will not be remediated on the operational cluster
Version-Release number of selected component (if applicable):
TALM 4.13.1, TALM 4.12.4
How reproducible:
Always
Steps to Reproduce:
1. Configure a hub cluster with two managed clusters. 2. Create a CGU which: - Specifies both clusters. - Concurrency = 1 - Timeout = 9 3. Power-off first cluster in specified list. 4. Enable CGU
Actual results:
GGU Times Out on both clusters.
Expected results:
CGU times out on first cluster. Second cluster completes successfully before the CGU times out.
Additional info:
Hub logs, cluster config can be found here: https://drive.google.com/drive/folders/1fFIeUO9X6h-o9OTGtAQc87ptT89sAFsh?usp=sharing This happens consistently in CI automation. Running the same automated test as a one-off outside of CI gives inconsistent results with some passes. --- Printing CGU spec - talm-test: generated-cgu-multi-spokes-one-unavailable : backup: false precaching: false enable: true clusters: - worker-0 - worker-1 clusterselector: [] clusterlabelselectors: [] remediationstrategy: canaries: [] maxconcurrency: 1 timeout: 9 managedpolicies: - generated-policy-multi-spokes-one-unavailable blockingcrs: [] actions: beforeenable: addclusterlabels: {} deleteclusterlabels: {} aftercompletion: addclusterlabels: talmcomplete: "" deleteclusterlabels: {} deleteobjects: true batchtimeoutaction: "" --- Printing CGU status - talm-test: generated-cgu-multi-spokes-one-unavailable : placementbindings: [] placementrules: [] copiedpolicies: [] conditions: - type: ClustersSelected status: "True" observedgeneration: 0 lasttransitiontime: "2023-06-17T01:45:37-04:00" reason: ClusterSelectionCompleted message: All selected clusters are valid - type: Validated status: "True" observedgeneration: 0 lasttransitiontime: "2023-06-17T01:45:37-04:00" reason: ValidationCompleted message: Completed validation - type: Progressing status: "False" observedgeneration: 0 lasttransitiontime: "2023-06-17T01:55:37-04:00" reason: TimedOut message: Policy remediation took too long - type: Succeeded status: "False" observedgeneration: 0 lasttransitiontime: "2023-06-17T01:55:37-04:00" reason: TimedOut message: Policy remediation took too long remediationplan: - - worker-0 - - worker-1 managedpoliciesns: generated-policy-multi-spokes-one-unavailable: talm-test saferesourcenames: {} managedpoliciesforupgrade: - name: generated-policy-multi-spokes-one-unavailable namespace: talm-test managedpoliciescompliantbeforeupgrade: [] managedpoliciescontent: {} clusters: - name: worker-0 state: timedout currentpolicy: name: generated-policy-multi-spokes-one-unavailable status: NonCompliant - name: worker-1 state: timedout currentpolicy: name: generated-policy-multi-spokes-one-unavailable status: NonCompliant status: startedat: "2023-06-17T01:45:37-04:00" completedat: "2023-06-17T01:55:37-04:00" currentbatch: 0 currentbatchstartedat: "0001-01-01T00:00:00Z" currentbatchremediationprogress: {} precaching: null backup: null computedmaxconcurrency: 1