Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1025

[tracker]cluster-monitoring-config race condition between Observability and du profile

    XMLWordPrintable

Details

    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      3/3: telco reviewed for 4.13
      12/7: ACM story is ready to test, see ACM-1933
      12/1: green for 4.12, the ACM story for 2.6.z will be ready tomorrow (though ACM is asking for help to test it)
      Tracker for ACM stories ACM-1753 (2.5.z) & ACM-1933 (2.6.z).
      Show
      3/3: telco reviewed for 4.13 12/7: ACM story is ready to test, see ACM-1933 12/1: green for 4.12, the ACM story for 2.6.z will be ready tomorrow (though ACM is asking for help to test it) Tracker for ACM stories ACM-1753 (2.5.z) & ACM-1933 (2.6.z).

    Description

      Description of problem:

      While deploying many SNOs, more than half showed the common-config policy as NonCompliant because Observability also modifies the cluster-monitoring-config.  Depending on which modifies the configmap first results in if the policy will end up Compliant or NonCompliant.

      Version-Release number of selected component (if applicable):

      HUB OCP 4.11.2
      SNO OCP 4.9.46
      ACM 2.6 RC2 - 2.6.0-DOWNSTREAM-2022-08-26-01-33-09
      Observability enabled

      How reproducible:

      Always

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      # oc get policy -n sno00007
      NAME                                                REMEDIATION ACTION   COMPLIANCE STATE   AGE
      ztp-common.common-config-policy                     inform               NonCompliant       46h
      ztp-common.common-subscriptions-policy              inform               Compliant          46h
      ztp-group-du-sno.du-upgrade-platform-upgrade        inform               Compliant          25h
      ztp-group-du-sno.du-upgrade-platform-upgrade-prep   inform               Compliant          25h
      ztp-group.group-du-sno-config-log-policy            inform               Compliant          46h
      ztp-group.group-du-sno-config-policy                inform               Compliant          46h
      ztp-group.group-du-sno-config-storage-policy        inform               Compliant          46h 

      Expected results:

      All policies to be compliant

      Additional info:

      Originally it was thought that openshift-monitoring was modifying the configmap and the original bug was opened here - https://issues.redhat.com/browse/OCPBUGS-870

       

      You can see that the endpoint-monitoring-operator modified the configmap last making the policy fall out of compliance:

      # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00007/kubeconfig get cm -n openshift-monitoring cluster-monitoring-config -o yaml --show-managed-fields=true
      apiVersion: v1
      data:
        config.yaml: |
          alertmanagerMain:
            nodeSelector: null
            resources: null
            tolerations: null
            volumeClaimTemplate: null
          enableUserWorkload: null
          grafana:
            nodeSelector: null
            tolerations: null
          http: null
          k8sPrometheusAdapter: null
          kubeStateMetrics: null
          openshiftStateMetrics: null
          prometheusK8s:
            additionalAlertManagerConfigs:
            - apiVersion: v2
              bearerToken:
                key: token
                name: observability-alertmanager-accessor
              pathPrefix: /
              scheme: https
              staticConfigs:
              - alertmanager-open-cluster-management-observability.apps.bm-stage.rdu2.scalelab.redhat.com
              tlsConfig:
                ServerName: ""
                ca:
                  key: service-ca.crt
                  name: hub-alertmanager-router-ca
                insecureSkipVerify: false
            externalLabels:
              cluster: 3f2759fc-42e3-4851-8099-5f5ad646f171
            logLevel: ""
            nodeSelector: null
            remoteWrite: null
            resources: null
            retention: 24h
            tolerations: null
            volumeClaimTemplate: null
          prometheusOperator: null
          telemeterClient: null
          thanosQuerier: null
      kind: ConfigMap
      metadata:
        creationTimestamp: "2022-09-06T16:12:13Z"
        managedFields:
        - apiVersion: v1
          fieldsType: FieldsV1
          fieldsV1:
            f:data: {}
          manager: config-policy-controller
          operation: Update
          time: "2022-09-06T16:12:13Z"
        - apiVersion: v1
          fieldsType: FieldsV1
          fieldsV1:
            f:data:
              f:config.yaml: {}
          manager: endpoint-monitoring-operator
          operation: Update
          time: "2022-09-06T16:14:53Z"
        name: cluster-monitoring-config
        namespace: openshift-monitoring
        resourceVersion: "21926"
        uid: d2cad672-4f6e-40d2-8d1b-f760668e48bf
       

      Attachments

        Issue Links

          Activity

            People

              smeduri1@redhat.com Subbarao Meduri
              akrzos@redhat.com Alex Krzos
              Alex Krzos Alex Krzos
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: