-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
-
premerge
-
+
-
No
-
MCO Sprint 250
-
1
-
False
-
-
No Doc Update
-
Release Note Not Required
-
In Progress
Description of problem:
When we remove the coreos-bootimages in a loop to force an error updating the bootimage in the machineset, the error happens but the MCCBootImageUpdateError alert is not triggered.
Version-Release number of selected component (if applicable):
Pre-merge testing in: https://github.com/openshift/machine-config-operator/pull/4194
How reproducible:
Always
Steps to Reproduce:
1.Without scaling down any operator, just run this command in background or in another shell to remove the coreos-bootimages in case it is recreated by CVO. $ watch -n 0.1 oc delete cm coreos-bootimages 2.Patch a machineset $ oc -n openshift-machine-api patch machineset.machine $(oc -n openshift-machine-api get machineset.machine -ojsonpath='{.items[0].metadata.name}') --type json -p '[{"op": "add", "path": "/spec/template/spec/providerSpec/value/disks/0/image", "value": "fake-image"}]' machineset.machine.openshift.io/sergidor-alarm2-lchlz-worker-a patched $ oc get machineset.machine sergidor-alarm2-lchlz-worker-a -o yaml |grep fake image: fake-image 3. After more than 30 minutes no alert is raised, but the machineset has not been correctly patched either $ date; alerts |grep alertname Thu Feb 22 01:40:08 PM UTC 2024 "alertname": "Watchdog", "alertname": "AlertmanagerReceiversNotConfigured", "alertname": "TechPreviewNoUpgrade", $ oc get machineset.machine sergidor-alarm2-lchlz-worker-a -o yaml |grep fake image: fake-image
Actual results:
The machinese is not updated but the alarm is not fired.
Expected results:
Since the machineset could not be correctly patched after 30 minutes an alarm should have been raise.
Additional info:
If we have a look at the MCO controller logs we can see that I0222 13:39:05.426850 1 machine_set_boot_image_controller.go:289] configMap coreos-bootimages added, reconciling all machine sets I0222 13:39:05.426907 1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete I0222 13:39:05.426927 1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete I0222 13:39:05.426937 1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete I0222 13:39:05.426940 1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete It seems that every now and then the MC is recreated by CVO and the failure happens here: https://github.com/openshift/machine-config-operator/blob/54a4a499c9dc00796a02d524b5b4490dc16856d9/pkg/controller/machine-set-boot-image/machine_set_boot_image_controller.go#L375 versionHashFromCM, versionHashFound := configMap.Data[ctrlcommon.MCOVersionHashKey] if !versionHashFound { klog.Infof("failed to find mco version hash in %s configmap, sync will exit to wait for the MCO upgrade to complete", ctrlcommon.BootImagesConfigMapName) return nil } It returns nil and resets the Metric. Since the metric is intermittently reset the alert is never fired.
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update