-
Bug
-
Resolution: Done
-
Critical
-
None
-
1
-
False
-
None
-
False
-
-
-
Observability Sprint 2023-08
-
Important
-
No
Description of problem:
While deploying 3000+ SNOs with ACM and ZTP, we have found occasional clusters showing the common-config-policy as noncompliant with the violation being that the cluster-monitoring-config was modified and it seems that OBS has done so via re-rendering the config file through some sort of yaml serializer which outputs unexpected null values. Since the cluster-monitoring-config is simply a yaml file inserted into a configmap as a single string, when ACM policy compares a modified string vs the expected string, it is found noncompliant. Furthermore it appears that some configuration may have been dropped. The fix here should be for OBS to not even re-render the config file and to leave it alone with the annotation exists to prevent obs from rolling out alerting configuration.
Version-Release number of selected component (if applicable):
2.7.0-DOWNSTREAM-2023-01-16-18-27-49
Hub OCP 4.11.19
SNO OCP 4.10.32
How reproducible:
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
ACM 2.7 large scale testing Run 13 we found cluster sno02400 showed a modified config file:
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno02400/kubeconfig get cm -n openshift-monitoring cluster-monitoring-config -o yaml apiVersion: v1 data: config.yaml: | alertmanagerMain: nodeSelector: null resources: null tolerations: null volumeClaimTemplate: null enableUserWorkload: null grafana: nodeSelector: null tolerations: null http: null k8sPrometheusAdapter: null kubeStateMetrics: null openshiftStateMetrics: null prometheusK8s: additionalAlertManagerConfigs: null externalLabels: null logLevel: "" nodeSelector: null remoteWrite: null resources: null retention: 24h tolerations: null volumeClaimTemplate: null prometheusOperator: null telemeterClient: null thanosQuerier: null kind: ConfigMap metadata: creationTimestamp: "2023-01-18T09:01:55Z" name: cluster-monitoring-config namespace: openshift-monitoring resourceVersion: "65919" uid: c210447a-d5bf-40bb-b2af-a3f1f48ed548
Compare to a non-modified config file:
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno00001/kubeconfig get cm -n openshift-monitoring cluster-monitoring-config -o yaml apiVersion: v1 data: config.yaml: | grafana: enabled: false alertmanagerMain: enabled: false prometheusK8s: retention: 24h kind: ConfigMap metadata: creationTimestamp: "2023-01-18T04:59:53Z" name: cluster-monitoring-config namespace: openshift-monitoring resourceVersion: "61261" uid: ad28cf77-5a2e-48e9-a0e7-5c6330415e70
- is related to
-
ACM-4538 (ACM 2.6) Configuration for Observability needed
- Closed
-
ACM-5623 (ACM 2.7) Configuration for Observability needed
- Closed
-
ACM-5624 (ACM 2.8) Configuration for Observability needed
- Closed
- relates to
-
OCPBUGS-1025 [tracker]cluster-monitoring-config race condition between Observability and du profile
- ON_QA