Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15217

Multi-Batch CGU Fails to Remediate Policy When One Cluster is Offline

XMLWordPrintable

    • Important
    • No
    • CNF RAN Sprint 238, CNF RAN Sprint 239, CNF RAN Sprint 240
    • 3
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When multiple clusters are specified in a CGU, if one cluster is offline policies will not be remediated on the operational cluster

      Version-Release number of selected component (if applicable):

      TALM 4.13.1, TALM 4.12.4

      How reproducible:

      Always

      Steps to Reproduce:

      1. Configure a hub cluster with two managed clusters.
      2. Create a CGU which:
         - Specifies both clusters. 
         - Concurrency = 1
         - Timeout = 9
      3. Power-off first cluster in specified list.
      4. Enable CGU

      Actual results:

      GGU Times Out on both clusters.

      Expected results:

      CGU times out on first cluster. Second cluster completes successfully before the CGU times out.

      Additional info:

      Hub logs, cluster config can be found here: https://drive.google.com/drive/folders/1fFIeUO9X6h-o9OTGtAQc87ptT89sAFsh?usp=sharing
      
      This happens consistently in CI automation. Running the same automated test as a one-off outside of CI gives inconsistent results with some passes.   --- Printing CGU spec - talm-test: generated-cgu-multi-spokes-one-unavailable :
      backup: false
      precaching: false
      enable: true
      clusters:
      - worker-0
      - worker-1
      clusterselector: []
      clusterlabelselectors: []
      remediationstrategy:
        canaries: []
        maxconcurrency: 1
        timeout: 9
      managedpolicies:
      - generated-policy-multi-spokes-one-unavailable
      blockingcrs: []
      actions:
        beforeenable:
          addclusterlabels: {}
          deleteclusterlabels: {}
        aftercompletion:
          addclusterlabels:
            talmcomplete: ""
          deleteclusterlabels: {}
          deleteobjects: true
      batchtimeoutaction: ""
      
      --- Printing CGU status - talm-test: generated-cgu-multi-spokes-one-unavailable :
      placementbindings: []
      placementrules: []
      copiedpolicies: []
      conditions:
      - type: ClustersSelected
        status: "True"
        observedgeneration: 0
        lasttransitiontime: "2023-06-17T01:45:37-04:00"
        reason: ClusterSelectionCompleted
        message: All selected clusters are valid
      - type: Validated
        status: "True"
        observedgeneration: 0
        lasttransitiontime: "2023-06-17T01:45:37-04:00"
        reason: ValidationCompleted
        message: Completed validation
      - type: Progressing
        status: "False"
        observedgeneration: 0
        lasttransitiontime: "2023-06-17T01:55:37-04:00"
        reason: TimedOut
        message: Policy remediation took too long
      - type: Succeeded
        status: "False"
        observedgeneration: 0
        lasttransitiontime: "2023-06-17T01:55:37-04:00"
        reason: TimedOut
        message: Policy remediation took too long
      remediationplan:
      - - worker-0
      - - worker-1
      managedpoliciesns:
        generated-policy-multi-spokes-one-unavailable: talm-test
      saferesourcenames: {}
      managedpoliciesforupgrade:
      - name: generated-policy-multi-spokes-one-unavailable
        namespace: talm-test
      managedpoliciescompliantbeforeupgrade: []
      managedpoliciescontent: {}
      clusters:
      - name: worker-0
        state: timedout
        currentpolicy:
          name: generated-policy-multi-spokes-one-unavailable
          status: NonCompliant
      - name: worker-1
        state: timedout
        currentpolicy:
          name: generated-policy-multi-spokes-one-unavailable
          status: NonCompliant
      status:
        startedat: "2023-06-17T01:45:37-04:00"
        completedat: "2023-06-17T01:55:37-04:00"
        currentbatch: 0
        currentbatchstartedat: "0001-01-01T00:00:00Z"
        currentbatchremediationprogress: {}
      precaching: null
      backup: null
      computedmaxconcurrency: 1 

              jche@redhat.com Jun Chen
              josclark@redhat.com Joshua Clark
              Joshua Clark Joshua Clark
              Joshua Clark
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: