-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.21
Description of problem:
The existing RAN ReduceMonitoringFootprint source CR causes the informDuValidatorValidator.yaml validator to fail in MNO deployments.
Version-Release number of selected component (if applicable):
OCP 4.21/ACM 2.15+
How reproducible:
Always
Steps to Reproduce:
1. Deploy hub cluster without Observability
2. Deploy an MNO spoke cluster using GitOps ZTP using the RAN RDS source CRs, which include ReduceMonitoringFootprint CR
2. Observe deployment progress.
3.
Actual results:
Spoke deployment fails due to Policies being NonCompliant.
[kni@registry.kni-qe-97 ~]$ oc get policy -A
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
kni-qe-96 ztp-common.common-v4.21-config-policy enforce Compliant 3h40m
kni-qe-96 ztp-common.common-v4.21-subscriptions-policy enforce Compliant 3h40m
kni-qe-96 ztp-group.group-du-standard-v4.21-config-policy enforce Compliant 3h40m
kni-qe-96 ztp-group.group-du-standard-validator-v4.21-du-policy enforce NonCompliant 3h40m
kni-qe-96 ztp-site.kni-qe-96-v4.21-config-policy enforce Compliant 3h40m
[kni@registry.kni-qe-97 ~]$ oc get mcp -A
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-b5a73d79f66578acee5849a74aa265b6 True False False 3 3 3 0 4h12m
worker rendered-worker-ea79df029175635fb9e102030187d7f6 False True True 2 1 1 1 4h12m
[kni@registry.kni-qe-97 ~]$ oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-controller
Defaulted container "machine-config-controller" out of: machine-config-controller, kube-rbac-proxy
I0210 14:45:42.694076 1 drain_controller.go:162] evicting pod openshift-monitoring/prometheus-k8s-0
E0210 14:45:42.704400 1 drain_controller.go:162] error when evicting pods/"prometheus-k8s-0" -n "openshift-monitoring" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0210 14:45:47.704695 1 drain_controller.go:162] evicting pod openshift-monitoring/prometheus-k8s-0
I0210 14:45:47.704773 1 drain_controller.go:192] node worker-1.kni-qe-96.telcoqe.eng.rdu2.dc.redhat.com: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"prometheus-k8s-0" -n "openshift-monitoring": global timeout reached: 1m30s
E0210 14:45:47.705385 1 event.go:442] "Could not construct reference, will not report event" err="no kind is registered for the type v1.Node in scheme \"github.com/openshift/client-go/machineconfiguration/clientset/versioned/scheme/register.go:15\"" object="&Node{ObjectMeta:{worker-1.kni-qe-96.telcoqe.eng.rdu2.dc.redhat.com 063c387a-2b5c-45c5-a8b9-c6aef36b3ab8 90741 0 2026-02-10 10:52:58 +0000 UTC <nil> <nil> map[beta.kubernetes.io/arch:arm64 beta.kubernetes.io/os:linux kubernetes.io/arch:arm64 kubernetes.io/hostname:worker-1.kni-qe-96.telcoqe.eng.rdu2.dc.redhat.com kubernetes.io/os:linux node-role.kubernetes.io/worker: node.openshift.io/os_id:rhel sriovnetwork.openshift.io/device-plugin:Enabled] map[k8s.ovn.org/host-cidrs:[\"10.6.159.15/24\",\"2620:52:9:169f::1001/64\"] k8s.ovn.org/l3-gateway-config:{\"default\":{\"mode\":\"shared\",\"bridge-id\":\"br-ex\",\"interface-id\":\"br-ex_worker-1.kni-qe-96.telcoqe.eng.rdu2.dc.redhat.com\",\"mac-address\":\"b8:e9:24:80:74:0e\",\"ip-addresses\":[\"10.6.159.15/24\",\"2620:52:9:169f::1001/64\"],\"next-hops\":[\"10.6.159.254\",\"2620:52:9:169f::1\"],\"node-port-enable\":\"true\",\"vlan-id\":\"0\"}} k8s.ovn.org/layer2-topology-version:2.0 k8s.ovn.org/node-chassis-id:58f4f60e-2fb1-4794-9ff6-133f614de79c k8s.ovn.org/node-encap-ips:[\"10.6.159.15\"] k8s.ovn.org
During the node updates, node fails to drain because prometheus-k8s pods cannot be evicted, causing the mcp to never be updated, causing the ztp-group.group-du-standard-validator-v4.21-du-policy to never be compliant
Expected results:
MNO Spoke deployment should succeed with the ReduceMonitoringFootprint CR applied when ACM observability is not enabled in the hub
Additional info:
With this patch, MNO deployment works as expected.
- path: source-crs/cluster-tuning/monitoring-configuration/ReduceMonitoringFootprint.yaml
patches:
- data:
config.yaml: |
alertmanagerMain:
enabled: false
telemeterClient:
enabled: false
prometheusK8s:
retention: 24h
Applying the CR without the patch works OK in SNO deployments.
- clones
-
OCPBUGS-63008 Setting mco-disable-alerting to true makes ACM to remove a URL from hub-info secret
-
- Closed
-