Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57796

Cluster manages bootimages despite explicit bootimages in installconfig

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • Done
    • Bug Fix
    • Hide
      Previously, if a user specified a custom boot image for Amazon Web Services (AWS) or the Google Cloud Platform (GCP), the Machine Config Operator (MCO) would overwrite it with the default manage image during installation. With this release, a manifest generation was added for MCO configuration which disables the default managed image during installation if a custom image is specified. (link:https://issues.redhat.com/browse/OCPBUGS-57796[OCPBUGS-57796])
      Show
      Previously, if a user specified a custom boot image for Amazon Web Services (AWS) or the Google Cloud Platform (GCP), the Machine Config Operator (MCO) would overwrite it with the default manage image during installation. With this release, a manifest generation was added for MCO configuration which disables the default managed image during installation if a custom image is specified. (link: https://issues.redhat.com/browse/OCPBUGS-57796 [ OCPBUGS-57796 ])
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-57348. The following is the description of the original issue:

      Description of problem

      OpenShift 4.19 introduced default boot-image management for AWS & GCP. The problem is that if users set a custom boot image, such as a marketplace image or otherwise custom image, the MCO will overwrite the user-specified image with the managed one (at some point during the install phase, see the Additional info section below for more about timing).

      Losing the image would be an issue for marketplace images or custom images may lose required assets, such as CAs, which could result in failures.

      Version-Release number of selected component (if applicable):

      4.19

      How reproducible

      The MCO-updates in MachineSets are always reproducible.

      Whether initial compute Machines are impacted depends on a race between the MCO updating the MachineSet's boot images and the Machine API using the MachineSet to create any initial compute Machines, as described in the Additional info section below.

      Steps to reproduce

      1. Set custom boot image in either the default or compute machine pool (control-plane boot image customization is safe until ControlPlaneMachineSet boot images management is delivered via MCO-1007).
        1. aws: platform.aws.amiID
        2. gcp: platform.gcp.osImage
      2. Perform installation
      3. Check machineset for updated bootimage

      On GCP clusters, checking MachineSet boot images looks like:

      $ oc -n openshift-machine-api get -o jsonpath='{range .items[*]}{range .spec.template.spec.providerSpec.value.disks[*]}{.image}{"\n"}{end}{end}' machinesets.machine.openshift.io | sort | uniq -c
      

      On AWS clusters, checking MachineSet boot images looks like:

      $ oc -n openshift-machine-api get -o jsonpath='{range .items[*]}{.spec.template.spec.providerSpec.value.ami}{"\n"}{end}' machinesets.machine.openshift.io | sort | uniq -c
      

      Actual results

      custom boot image is overwritten in machinesets

      Expected results

      custom boot image is maintained in machineset

      Additional info

      rhn-support-sdodson pointed out that a minimum boot image will be enforced in the cluster (RFE-6216), so if we decide that users will be on the hook for managing boot images when specifying custom images, we will need to document the need to update. But this is an issue that will need to get sorted, regardless of installer behavior.

      As an example of the MCO-boot-image-update vs. Machine API Machine-creation race, https://amd64.ocp.releases.ci.openshift.org/ > 4.19.0-0.nightly-2025-06-06-163527 > rosa-classic-sts-conformance > Artifacts > ... > gather-extra artifacts and must-gather artifacts:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1931028069485645824/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/configmaps.json | jq -r '.items[] | select(.metadata | .namespace == "kube-system" and .name == "cluster-config-v1").data["install-config"]' | yaml2json | jq -r '.compute[].platform.aws.amiID'
      ami-0e97cca5690da89da
      $ curl -s curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1931028069485645824/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/machinesets.json | jq -r '.items[] | select(.metadata.generation > 1) | .metadata.name + " " + (.metadata.generation | tostring) + " " + .spec.template.spec.providerSpec.value.ami.id'
      ci-rosa-s-vpvk-pwmbq-worker-us-west-2a 2 ami-0b29d41f2ed6b8c94
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1931028069485645824/artifacts/e2e-rosa-sts-ovn/gather-must-gather/artifacts/must-gather.tar | tar -xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a1ad408471607f7b402d6e1be8b4606a4605f408ae9ecb763e7d8adc776e46d1/namespaces/openshift-machine-api/machine.openshift.io/machinesets/ci-rosa-s-vpvk-pwmbq-worker-us-west-2a.yaml | yaml2json | jq -r '[.metadata.managedFields[] | select(.subresource == null) | .time + " " + .operation + " " + .manager + " " + (.fieldsV1 | tostring[:100])] | sort[]'
      2025-06-06T17:36:28Z Update cluster-bootstrap \{"f:metadata":{"f:labels":{".":{},"f:hive.openshift.io/machine-pool":{},"f:hive.openshift.io/managed
      2025-06-06T17:41:46Z Update machine-config-controller \{"f:spec":{"f:template":{"f:spec":{"f:providerSpec":{"f:value":{"f:ami":{"f:id":{}}}}}}}}
      2025-06-06T17:41:59Z Update machine-controller-manager \{"f:metadata":{"f:annotations":{".":{},"f:capacity.cluster-autoscaler.kubernetes.io/labels":{},"f:ma
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1931028069485645824/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/machines.json | jq -r '.items[] | (.metadata | .creationTimestamp + " " + .name) + " " + .spec.providerSpec.value.ami.id' | sort | grep worker
      2025-06-06T17:41:50Z ci-rosa-s-vpvk-pwmbq-worker-us-west-2a-95rxf ami-0b29d41f2ed6b8c94
      2025-06-06T17:41:50Z ci-rosa-s-vpvk-pwmbq-worker-us-west-2a-gb78t ami-0b29d41f2ed6b8c94
      

      So:

      • The install-config requested ami-0e97cca5690da89da.
      • 17:36:28, cluster-bootstrap pushes the installer-created MachineConfig into the cluster.
      • 17:41:46, the MCO updates the MachineSet's boot image to the stock OCP AMI for that region: ami-0b29d41f2ed6b8c94.
      • 17:41:50, the MachineAPI creates the first Machine using the stock ami-0b29d41f2ed6b8c94.

      It seems like this race could easily go the other way, and the initial compute could have come up under the install-config-preferred boot image. But the check for "have we fixed the bug?" shouldn't hinge on the boot image used for the compute Machines, it should look at the MachineSet to see "did the MCO clobber the MachineSet boot image at all?", as described in the Expected results section.

              padillon Patrick Dillon
              openshift-crt-jira-prow OpenShift Prow Bot
              None
              None
              Gaoyun Pei Gaoyun Pei
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: