Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29857

MCCBootImageUpdateError alert is not fired in some scenarios

XMLWordPrintable

      Description of problem:

      
      When we  remove the coreos-bootimages in a loop to force an error updating the bootimage in the machineset, the error happens but the MCCBootImageUpdateError alert is not triggered.
      
      
          

      Version-Release number of selected component (if applicable):

      Pre-merge testing in:
      https://github.com/openshift/machine-config-operator/pull/4194
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1.Without scaling down any operator, just run this command in background or in another shell to remove the coreos-bootimages in case it is recreated by CVO.
      $ watch -n 0.1 oc delete cm coreos-bootimages
      
      
          2.Patch a machineset
      $ oc -n openshift-machine-api patch machineset.machine $(oc -n openshift-machine-api get machineset.machine -ojsonpath='{.items[0].metadata.name}') --type json -p '[{"op": "add", "path": "/spec/template/spec/providerSpec/value/disks/0/image", "value": "fake-image"}]'
      machineset.machine.openshift.io/sergidor-alarm2-lchlz-worker-a patched
      
      
      $ oc get machineset.machine sergidor-alarm2-lchlz-worker-a -o yaml |grep fake
                  image: fake-image
      
          3. After more than 30 minutes no alert is raised, but the machineset has not been correctly patched either
      $ date; alerts   |grep alertname
      Thu Feb 22 01:40:08 PM UTC 2024
              "alertname": "Watchdog",
              "alertname": "AlertmanagerReceiversNotConfigured",
              "alertname": "TechPreviewNoUpgrade",
      
      $ oc get machineset.machine sergidor-alarm2-lchlz-worker-a -o yaml |grep fake
                  image: fake-image
          

      Actual results:

      The machinese is not updated but the alarm is not fired.
      
          

      Expected results:

      Since the machineset could not be correctly patched after 30 minutes an alarm should have been raise.
      
      
          

      Additional info:

      If we have a look at the MCO controller logs we can see that
      
      I0222 13:39:05.426850       1 machine_set_boot_image_controller.go:289] configMap coreos-bootimages added, reconciling all machine sets
      I0222 13:39:05.426907       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete
      I0222 13:39:05.426927       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete
      I0222 13:39:05.426937       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete
      I0222 13:39:05.426940       1 machine_set_boot_image_controller.go:375] failed to find mco version hash in coreos-bootimages configmap, sync will exit to wait for the MCO upgrade to complete
      
      It seems that every now and then the MC is recreated by CVO and the failure happens here:
      
      https://github.com/openshift/machine-config-operator/blob/54a4a499c9dc00796a02d524b5b4490dc16856d9/pkg/controller/machine-set-boot-image/machine_set_boot_image_controller.go#L375
      
      	versionHashFromCM, versionHashFound := configMap.Data[ctrlcommon.MCOVersionHashKey]
      	if !versionHashFound {
      		klog.Infof("failed to find mco version hash in %s configmap, sync will exit to wait for the MCO upgrade to complete", ctrlcommon.BootImagesConfigMapName)
      		return nil
      	}
      
      It returns nil and resets the Metric. Since the metric is intermittently reset the alert is never fired.
      
          

              djoshy David Joshy
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: