[OCPBUGS-29857] MCCBootImageUpdateError alert is not fired in some scenarios - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: premerge
Component/s: Machine Config Operator
Labels:
- mco-triaged
- pre-merge-tested

Test Coverage:

+
Regression:
No
Sprint:
MCO Sprint 250
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
No Doc Update
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When we  remove the coreos-bootimages in a loop to force an error updating the bootimage in the machineset, the error happens but the MCCBootImageUpdateError alert is not triggered.

Version-Release number of selected component (if applicable):

Pre-merge testing in:
https://github.com/openshift/machine-config-operator/pull/4194

How reproducible:

Always

Steps to Reproduce:

    1.Without scaling down any operator, just run this command in background or in another shell to remove the coreos-bootimages in case it is recreated by CVO.
$ watch -n 0.1 oc delete cm coreos-bootimages


    2.Patch a machineset
$ oc -n openshift-machine-api patch machineset.machine $(oc -n openshift-machine-api get machineset.machine -ojsonpath='{.items[0].metadata.name}') --type json -p '[{"op": "add", "path": "/spec/template/spec/providerSpec/value/disks/0/image", "value": "fake-image"}]'
machineset.machine.openshift.io/sergidor-alarm2-lchlz-worker-a patched


$ oc get machineset.machine sergidor-alarm2-lchlz-worker-a -o yaml |grep fake
            image: fake-image

    3. After more than 30 minutes no alert is raised, but the machineset has not been correctly patched either
$ date; alerts   |grep alertname
Thu Feb 22 01:40:08 PM UTC 2024
        "alertname": "Watchdog",
        "alertname": "AlertmanagerReceiversNotConfigured",
        "alertname": "TechPreviewNoUpgrade",

$ oc get machineset.machine sergidor-alarm2-lchlz-worker-a -o yaml |grep fake
            image: fake-image

Actual results:

The machinese is not updated but the alarm is not fired.

Expected results:

Since the machineset could not be correctly patched after 30 minutes an alarm should have been raise.

Additional info:

If we have a look at the MCO controller logs we can see that

I0222 13:39:05.426850       1 machine_set_boot_image_controller.go:289] configMap coreos-bootimages added, reconciling all machine sets
I0222 13:39:05.426907       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete
I0222 13:39:05.426927       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete
I0222 13:39:05.426937       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete
I0222 13:39:05.426940       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete

It seems that every now and then the MC is recreated by CVO and the failure happens here:

https://github.com/openshift/machine-config-operator/blob/54a4a499c9dc00796a02d524b5b4490dc16856d9/pkg/controller/machine-set-boot-image/machine_set_boot_image_controller.go#L375

	versionHashFromCM, versionHashFound := configMap.Data[ctrlcommon.MCOVersionHashKey]
	if !versionHashFound {
		klog.Infof("failed to find mco version hash in %s configmap, sync will exit to wait for the MCO upgrade to complete", ctrlcommon.BootImagesConfigMapName)
		return nil
	}

It returns nil and resets the Metric. Since the metric is intermittently reset the alert is never fired.

relates to

MCO-994 Update Boot Images for GCP GA

Release Pending

MCO-1039 Pre-merge Testing

Closed

links to

https://github.com/openshift/machine-config-operator/pull/4194

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: David Joshy

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/02/22 4:24 PM

Updated:: 2024/06/27 11:39 AM

Resolved:: 2024/06/27 11:39 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide