Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54266

Unexpected on-disk state - osimageurl mismatch

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.14.z
    • RHCOS
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      After updating a NodePool with a new configMap, the NodePool is now a ReconcileError loop
          

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      It's the first time I've run into this issue. I've updated the NodePool in the past with configMaps but this is the first time I've seen it degrading the node
          

      Steps to Reproduce:

          1. Create configMap with MachineConfig info
          2. Update the NodePool to reference the newly created configMap
          3.
          

      Actual results:

      The NodePool returns the following:
      {code:java}
      Warning  ReconcileError  53m (x16 over 35d)  nodepool-controller  failed to get config: failed to check if we manage haproxy ignition config: failed to look up image metadata for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:17dad53230cc8f95e1b993cc34a77351fb77bd903387661f0df41fff72746daa: failed to obtain image configuration for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:17dad53230cc8f95e1b993cc34a77351fb77bd903387661f0df41fff72746daa: cannot retrieve image configuration for manifest sha256:bc428181ba542fba9d93aaf3ff362cf64c0dc7f45bc566d79d7b084ab04614d7 in manifest list sha256:17dad53230cc8f95e1b993cc34a77351fb77bd903387661f0df41fff72746daa: received unexpected HTTP status: 502 Bad Gateway
      

      The hosted-cluster-config-operator returns the following:

      {"level":"info","ts":"2025-03-26T15:37:26Z","msg":"discovered mco image","controller":"inplaceupgrader","object":{"name":"nodepool-prod-stc-infosec-dc-rdu2-1","namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2"},"namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2","name":"nodepool-prod-stc-infosec-dc-rdu2-1","reconcileID":"6e6aaa5c-46eb-4174-8925-9832200e731b","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e93f4635bf0ffa1796b375bab78ac2644c6a07d001ce8441a6294dfc6fe3488"}
      {"level":"info","ts":"2025-03-26T15:37:26Z","msg":"Reconciled MachineSet","controller":"inplaceupgrader","object":{"name":"nodepool-prod-stc-infosec-dc-rdu2-1","namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2"},"namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2","name":"nodepool-prod-stc-infosec-dc-rdu2-1","reconcileID":"6e6aaa5c-46eb-4174-8925-9832200e731b","result":"unchanged"}
      {"level":"error","ts":"2025-03-26T15:37:26Z","msg":"Reconciler error","controller":"inplaceupgrader","object":{"name":"nodepool-prod-stc-infosec-dc-rdu2-1","namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2"},"namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2","name":"nodepool-prod-stc-infosec-dc-rdu2-1","reconcileID":"6e6aaa5c-46eb-4174-8925-9832200e731b","error":"degraded node found, cannot progress in-place upgrade. Degraded reason: disk validation failed: expected target osImageURL \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8\", have \"\" (\"b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f\")","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/...
      

      The machine-config-daemon is reporting the following:

      > oc logs -f machine-config-daemon-hosted-worker-024.nodes.prod.psi.rdu2.redhat.com
      I0326 15:38:06.332528    5640 start.go:61] Version: v4.14.0-202407100410.p0.gea6671e.assembly.stream.el8-dirty (ea6671e27ec6683aa986b7072b022f0393a0e15a)
      I0326 15:38:06.332568    5640 update.go:1995] Running: mount --rbind /run/secrets /rootfs/run/secrets
      I0326 15:38:06.338393    5640 update.go:1995] Running: mount --rbind /usr/bin /rootfs/run/machine-config-daemon-bin
      I0326 15:38:06.340814    5640 daemon.go:457] container is rhel8, target is rhel9
      I0326 15:38:06.500197    5640 daemon.go:525] Invoking re-exec /run/bin/machine-config-daemon
      I0326 15:38:06.534097    5640 start.go:61] Version: v4.14.0-202407100410.p0.gea6671e.assembly.stream.el8-dirty (ea6671e27ec6683aa986b7072b022f0393a0e15a)
      I0326 15:38:06.534299    5640 update.go:1995] Running: systemctl daemon-reload
      I0326 15:38:06.845491    5640 rpm-ostree.go:88] Enabled workaround for bug 2111817
      E0326 15:38:06.845593    5640 rpm-ostree.go:284] Merged secret file could not be validated; defaulting to cluster pull secret <nil>
      I0326 15:38:06.845615    5640 rpm-ostree.go:262] Linking ostree authfile to /var/lib/kubelet/config.json
      I0326 15:38:07.089993    5640 daemon.go:270] Booted osImageURL:  (414.92.202305090606-0) b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f
      I0326 15:38:07.091273    5640 start.go:124] overriding kubernetes api to https://api.prod-stc-infosec-dc-rdu2.ocp-hub.prod.psi.redhat.com:6443
      W0326 15:38:07.112659    5640 controller_context.go:111] unable to get owner reference (falling back to namespace): unable to setup event recorder as "POD_NAME" env variable is not set and there are no pods
      I0326 15:38:07.112873    5640 simple_featuregate_reader.go:171] Starting feature-gate-detector
      I0326 15:38:07.124098    5640 writer.go:88] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig
      I0326 15:38:07.124210    5640 daemon.go:777] Starting MachineConfigDaemon - Hypershift
      I0326 15:38:08.171605    5640 daemon.go:890] Successfully read current/desired Config
      I0326 15:38:08.208520    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
      E0326 15:38:09.315855    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")
      I0326 15:38:11.225232    5640 daemon.go:890] Successfully read current/desired Config
      I0326 15:38:11.260954    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
      E0326 15:38:12.713085    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")
      I0326 15:38:28.768536    5640 daemon.go:890] Successfully read current/desired Config
      I0326 15:38:28.803236    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
      E0326 15:38:29.555588    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")
      I0326 15:39:29.609959    5640 daemon.go:890] Successfully read current/desired Config
      I0326 15:39:29.646590    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
      E0326 15:39:30.894835    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")
      
          Expected results:{code:none}
      I expect the NodePool to not be degraded and the nodes to rollout the updated MachineConfig on all the nodes
          

      Additional info:

      I've went through https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit?tab=t.0#heading=h.kcqrbb1p1rpj and applied the forcefile which rebooted the node to no avail.

      The following was captured after executing the forcefile and the node auto-rebooted

      sh-5.1# rpm-ostree status State: idle Warning: failed to finalize previous deployment error: failed to run semodule: Child process exited with code 127 check `journalctl -b -1 -u ostree-finalize-staged.service` Deployments: * b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f Version: 414.92.202305090606-0 (2023-05-09T06:09:26Z) 
      sh-5.1# journalctl -b -1 -u ostree-finalize-staged.service
      Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Finished OSTree Finalize Staged Deployment.
      Mar 26 15:33:02 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopping OSTree Finalize Staged Deployment...
      Mar 26 15:33:02 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: Finalizing staged deployment
      Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: Copying /etc changes: 19 modified, 0 removed, 578 added
      Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: Copying /etc changes: 19 modified, 0 removed, 578 added
      Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899524]: semodule: error while loading shared libraries: /lib64/libsemanage.so.2: invalid ELF header
      Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: error: failed to run semodule: Child process exited with code 127
      Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: ostree-finalize-staged.service: Control process exited, code=exited, status=1/FAILURE
      Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: ostree-finalize-staged.service: Failed with result 'exit-code'.
      Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopped OSTree Finalize Staged Deployment.
      
      sh-5.1# journalctl -b -1 -u rpm-ostreed.service
      Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Starting rpm-ostree System Management Daemon...
      Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: Reading config file '/etc/rpm-ostreed.conf'
      Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: In idle state; will auto-exit in 64 seconds
      Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Started rpm-ostree System Management Daemon.
      Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: client(id:cli dbus:1.2846 unit:crio-38580971eeeb66b069f7936bb7f9738476da228cd5a3c41928c065745f774134.scope uid:0) added; new total=1
      Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: client(id:cli dbus:1.2846 unit:crio-38580971eeeb66b069f7936bb7f9738476da228cd5a3c41928c065745f774134.scope uid:0) vanished; remaining=0
      Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: In idle state; will auto-exit in 64 seconds
      Mar 26 13:52:25 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Deactivated successfully.
      Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Starting rpm-ostree System Management Daemon...
      Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: Reading config file '/etc/rpm-ostreed.conf'
      Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: In idle state; will auto-exit in 63 seconds
      Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Started rpm-ostree System Management Daemon.
      Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: client(id:cli dbus:1.3004 unit:crio-39b8158b8fae369b8e6ec77a002294da372e365e88f4a808d9f831e2748f4efc.scope uid:0) added; new total=1
      Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: client(id:cli dbus:1.3004 unit:crio-39b8158b8fae369b8e6ec77a002294da372e365e88f4a808d9f831e2748f4efc.scope uid:0) vanished; remaining=0
      Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: In idle state; will auto-exit in 64 seconds
      Mar 26 15:11:37 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Deactivated successfully.
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Starting rpm-ostree System Management Daemon...
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Reading config file '/etc/rpm-ostreed.conf'
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 61 seconds
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Started rpm-ostree System Management Daemon.
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3048 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) added; new total=1
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Loaded sysroot
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Locked sysroot
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.3048 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0>
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898548 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] connected to transaction progress
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Unlocked sysroot
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898548 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] disconnected from transaction progress
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3048 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) vanished; remaining=0
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 61 seconds
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:cli dbus:1.3049 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) added; new total=1
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:cli dbus:1.3049 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) vanished; remaining=0
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 62 seconds
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3050 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) added; new total=1
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Loaded sysroot
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Locked sysroot
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Initiated txn Rebase for client(id:machine-config-operator dbus:1.3050 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0)>
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898569 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] connected to transaction progress
      Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Fetching ostree-unverified-image:docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154>
      Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Created new deployment /ostree/deploy/rhcos/deploy/27551d8ffb2ffbc3af714f5e8cdada699c02b7ae6b3f7fe3a5a16024a43aab2e.4
      Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: sanitycheck(/usr/bin/true) successful
      Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Pruned container image layers: 0
      Mar 26 15:32:15 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos successful
      Mar 26 15:32:15 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Unlocked sysroot
      Mar 26 15:32:15 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898569 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] disconnected from transaction progress
      Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3050 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) vanished; remaining=0
      Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 61 seconds
      Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopping rpm-ostree System Management Daemon...
      Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Deactivated successfully.
      Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopped rpm-ostree System Management Daemon.
      Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Consumed 3.116s CPU time.
      
       

              Unassigned Unassigned
              rhit_amohamed Ahmed Mohamed
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: