Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65953

Rendered cluster-monitoring-config cm is mangled, causes prometheus-k8s-0 pod restart loop

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.21
    • GitOps ZTP
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      
      with telco-ran CR ReduceMonitoringFootprint.yaml
      applied to a spoke running OCP 4.20 with ZTP/TALM 4.21 (IBU target host), the rendered ConfigMap
      looks garbled:
      
      apiVersion: v1
      data:
        config.yaml: "alertmanagerMain:\n  enabled: false\ntelemeterClient:\n  enabled:
          false\nnodeExporter:\n  collectors:\n    buddyinfo: {}\n    cpufreq: {}\n    ksmd:
          {}\n    mountstats: {}\n    netclass: {}\n    netdev: {}\n    processes: {}\n
          \   systemd: {}\n    tcpstat: {}\nprometheusK8s:\n  additionalAlertmanagerConfigs:\n
          \ - apiVersion: v2\n    bearerToken:\n      key: token\n      name: observability-alertmanager-accessor\n
          \   scheme: https\n    staticConfigs:\n    - \n    tlsConfig:\n      ca:\n        key:
          service-ca.crt\n        name: hub-alertmanager-router-ca\n      insecureSkipVerify:
          false\n  externalLabels:\n    managed_cluster: b9ca26b2-55ed-4ec8-826b-3917eb8e24c7\n
          \ retention: 24h\n"
      kind: ConfigMap
      metadata:
        creationTimestamp: "2025-11-24T23:01:14Z"
        name: cluster-monitoring-config
        namespace: openshift-monitoring
        resourceVersion: "12472"
        uid: 0e5629d5-8da9-4c48-b001-75c3ad713bb5
      
      ######
      
      When this CR is applied on a spoke via ZTP/TALM 4.20, the ConfigMap appears to render correctly,
      and the prometheus-k8s-0 pod is stable.
      apiVersion: v1
      data:
        config.yaml: |
          alertmanagerMain:
            enabled: false
          telemeterClient:
            enabled: false
          prometheusK8s:
            retention: 24h
      kind: ConfigMap
      metadata:
        annotations:
          ran.openshift.io/ztp-deploy-wave: "1"
        creationTimestamp: "2025-11-24T23:19:41Z"
        name: cluster-monitoring-config
        namespace: openshift-monitoring
        resourceVersion: "21867"
        uid: 2c762218-020b-40a1-8916-1225287b96e1
      
      with ZTP 4.21, if the CR is completely overriden, i.e. like this (PGT):
      
          - fileName: ReduceMonitoringFootprint.yaml
            data:
              config.yaml: |
                alertmanagerMain:
                  enabled: false
                telemeterClient:
                  enabled: false
                prometheusK8s:
                  retention: 24h
      
      And the prometheus-k8s-0 pod is deleted after the openshift-monitoring cm 
      is updated to match 4.20 as above by the policy change,
      the pod starts and is stable.
      
      
          

      Version-Release number of selected component (if applicable):

      
      In both cases spoke is deployed with OCP 4.20
      
      hubs for working and non-working spoke env have these in common
      $ oc get clusterversions.config.openshift.io 
      NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.21.0-ec.3   True        False         3d6h    Cluster version is 4.21.0-ec.3
      [kni@kni-qe-82 dgonyier]$ source ./csv-list 
      advanced-cluster-management.v2.15.0
      multicluster-engine.v2.10.0
      openshift-gitops-operator.v1.18.1
      packageserver
      
      Differences on hubs
      working:
      ZTP 4.20
      TALM 4.20.1
      
      non-working:
      ZTP 4.21
      TALM 4.21.0
      
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Deploy spoke with RAN DU profile, with hub csv versions for failing case as shown above
          2. Review monitoring policy and live CR
          3.
          

      Actual results:

      CR looks mangled
          

      Expected results:

      CR is sane
          

      Additional info:

      
      Currently the source CR is identical between main and release-4.20 branches.
      
      I am not sure if this is due to ZTP 4.21 or TALM 4.21
      
      Workarounds:
      don't use the ReduceMonitoringFootprint.yaml CR
      OR
      override the `data:` field in the policy template with the working config as described above.
          

              sahasan@redhat.com Sabbir Hasan
              rhn-support-dgonyier Dwaine Gonyier
              None
              None
              Dwaine Gonyier Dwaine Gonyier
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: