Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55041

CGU says completed and all clusters compliant when first batch times out

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17.z, 4.16.z, 4.18.z, 4.19
    • TALM Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • Yes
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-55036. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-54988. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-54978. The following is the description of the original issue:

      Description of problem:

          Although there is no longer a panic and the CGU works as expected applying policies in batches, the status is reported differently and in a misleading way. It appears that the CurrentBatchRemediationProgress gets set to nil when the CGU is completed and the conditions now show Completed instead of TimedOut even when the first batch did time out. It also shows a message saying all clusters are compliant with all policies, even if this is untrue.

      Version-Release number of selected component (if applicable):

          showed in TALM from brew versions v4.19.0-38, v4.18.1-14, and v4.16.4-9 in CI

      How reproducible:

          always

      Steps to Reproduce:

          1. Create CGU with max concurrency of 1 and two clusters where the first cluster is powered off
          2. Wait for the first cluster/batch to time out and the second cluster/batch to succeed.
          3. Check conditions/run oc get on the CGU and notice that it says Completed with message "All clusters are compliant with all the managed policies" even though .status.clusters[0].currentPolicy.status is NonCompliant     

      Actual results:

          Condition type Succeeded has reason Completed

      Expected results:

          Condition type Succeeded has reason TimedOut

      Additional info:

      [klaskosk@klaskosk-thinkpadp1gen3 ~]$ KUBECONFIG=~/kniqe16-kubeconfig oc get cgu -n talm-test talm-cgu
      NAME       AGE   STATE       DETAILS
      talm-cgu   11m   Completed   All clusters are compliant with all the managed policies

      Here's the oc get output to show how deceiving it appears, also a google drive with the logs collected.

              jche@redhat.com Jun Chen
              openshift-crt-jira-prow OpenShift Prow Bot
              None
              None
              Kirsten Laskoski Kirsten Laskoski
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: