Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: RHCOS
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

After updating a NodePool with a new configMap, the NodePool is now a ReconcileError loop

Version-Release number of selected component (if applicable):

How reproducible:

It's the first time I've run into this issue. I've updated the NodePool in the past with configMaps but this is the first time I've seen it degrading the node

Steps to Reproduce:

    1. Create configMap with MachineConfig info
    2. Update the NodePool to reference the newly created configMap
    3.

Actual results:

The NodePool returns the following:
{code:java}
Warning  ReconcileError  53m (x16 over 35d)  nodepool-controller  failed to get config: failed to check if we manage haproxy ignition config: failed to look up image metadata for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:17dad53230cc8f95e1b993cc34a77351fb77bd903387661f0df41fff72746daa: failed to obtain image configuration for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:17dad53230cc8f95e1b993cc34a77351fb77bd903387661f0df41fff72746daa: cannot retrieve image configuration for manifest sha256:bc428181ba542fba9d93aaf3ff362cf64c0dc7f45bc566d79d7b084ab04614d7 in manifest list sha256:17dad53230cc8f95e1b993cc34a77351fb77bd903387661f0df41fff72746daa: received unexpected HTTP status: 502 Bad Gateway

The hosted-cluster-config-operator returns the following:

{"level":"info","ts":"2025-03-26T15:37:26Z","msg":"discovered mco image","controller":"inplaceupgrader","object":{"name":"nodepool-prod-stc-infosec-dc-rdu2-1","namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2"},"namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2","name":"nodepool-prod-stc-infosec-dc-rdu2-1","reconcileID":"6e6aaa5c-46eb-4174-8925-9832200e731b","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e93f4635bf0ffa1796b375bab78ac2644c6a07d001ce8441a6294dfc6fe3488"}
{"level":"info","ts":"2025-03-26T15:37:26Z","msg":"Reconciled MachineSet","controller":"inplaceupgrader","object":{"name":"nodepool-prod-stc-infosec-dc-rdu2-1","namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2"},"namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2","name":"nodepool-prod-stc-infosec-dc-rdu2-1","reconcileID":"6e6aaa5c-46eb-4174-8925-9832200e731b","result":"unchanged"}
{"level":"error","ts":"2025-03-26T15:37:26Z","msg":"Reconciler error","controller":"inplaceupgrader","object":{"name":"nodepool-prod-stc-infosec-dc-rdu2-1","namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2"},"namespace":"prod-stc-infosec-dc-rdu2-prod-stc-infosec-dc-rdu2","name":"nodepool-prod-stc-infosec-dc-rdu2-1","reconcileID":"6e6aaa5c-46eb-4174-8925-9832200e731b","error":"degraded node found, cannot progress in-place upgrade. Degraded reason: disk validation failed: expected target osImageURL \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8\", have \"\" (\"b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f\")","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/...

The machine-config-daemon is reporting the following:

> oc logs -f machine-config-daemon-hosted-worker-024.nodes.prod.psi.rdu2.redhat.com
I0326 15:38:06.332528    5640 start.go:61] Version: v4.14.0-202407100410.p0.gea6671e.assembly.stream.el8-dirty (ea6671e27ec6683aa986b7072b022f0393a0e15a)
I0326 15:38:06.332568    5640 update.go:1995] Running: mount --rbind /run/secrets /rootfs/run/secrets
I0326 15:38:06.338393    5640 update.go:1995] Running: mount --rbind /usr/bin /rootfs/run/machine-config-daemon-bin
I0326 15:38:06.340814    5640 daemon.go:457] container is rhel8, target is rhel9
I0326 15:38:06.500197    5640 daemon.go:525] Invoking re-exec /run/bin/machine-config-daemon
I0326 15:38:06.534097    5640 start.go:61] Version: v4.14.0-202407100410.p0.gea6671e.assembly.stream.el8-dirty (ea6671e27ec6683aa986b7072b022f0393a0e15a)
I0326 15:38:06.534299    5640 update.go:1995] Running: systemctl daemon-reload
I0326 15:38:06.845491    5640 rpm-ostree.go:88] Enabled workaround for bug 2111817
E0326 15:38:06.845593    5640 rpm-ostree.go:284] Merged secret file could not be validated; defaulting to cluster pull secret <nil>
I0326 15:38:06.845615    5640 rpm-ostree.go:262] Linking ostree authfile to /var/lib/kubelet/config.json
I0326 15:38:07.089993    5640 daemon.go:270] Booted osImageURL:  (414.92.202305090606-0) b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f
I0326 15:38:07.091273    5640 start.go:124] overriding kubernetes api to https://api.prod-stc-infosec-dc-rdu2.ocp-hub.prod.psi.redhat.com:6443
W0326 15:38:07.112659    5640 controller_context.go:111] unable to get owner reference (falling back to namespace): unable to setup event recorder as "POD_NAME" env variable is not set and there are no pods
I0326 15:38:07.112873    5640 simple_featuregate_reader.go:171] Starting feature-gate-detector
I0326 15:38:07.124098    5640 writer.go:88] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig
I0326 15:38:07.124210    5640 daemon.go:777] Starting MachineConfigDaemon - Hypershift
I0326 15:38:08.171605    5640 daemon.go:890] Successfully read current/desired Config
I0326 15:38:08.208520    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
E0326 15:38:09.315855    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")
I0326 15:38:11.225232    5640 daemon.go:890] Successfully read current/desired Config
I0326 15:38:11.260954    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
E0326 15:38:12.713085    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")
I0326 15:38:28.768536    5640 daemon.go:890] Successfully read current/desired Config
I0326 15:38:28.803236    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
E0326 15:38:29.555588    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")
I0326 15:39:29.609959    5640 daemon.go:890] Successfully read current/desired Config
I0326 15:39:29.646590    5640 daemon.go:899] No diff detected. Assuming a previous update was completed. Checking on-disk state.
E0326 15:39:30.894835    5640 writer.go:226] Marking Degraded due to: disk validation failed: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154e8", have "" ("b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f")

    Expected results:{code:none}
I expect the NodePool to not be degraded and the nodes to rollout the updated MachineConfig on all the nodes

Additional info:

I've went through https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit?tab=t.0#heading=h.kcqrbb1p1rpj and applied the forcefile which rebooted the node to no avail.

The following was captured after executing the forcefile and the node auto-rebooted

sh-5.1# rpm-ostree status State: idle Warning: failed to finalize previous deployment error: failed to run semodule: Child process exited with code 127 check `journalctl -b -1 -u ostree-finalize-staged.service` Deployments: * b2b90997a32bae13c7d91be179737d37ac81607cc45f749fb6ce75dbc9bcd62f Version: 414.92.202305090606-0 (2023-05-09T06:09:26Z)

sh-5.1# journalctl -b -1 -u ostree-finalize-staged.service
Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Finished OSTree Finalize Staged Deployment.
Mar 26 15:33:02 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopping OSTree Finalize Staged Deployment...
Mar 26 15:33:02 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: Finalizing staged deployment
Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: Copying /etc changes: 19 modified, 0 removed, 578 added
Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: Copying /etc changes: 19 modified, 0 removed, 578 added
Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899524]: semodule: error while loading shared libraries: /lib64/libsemanage.so.2: invalid ELF header
Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com ostree[899516]: error: failed to run semodule: Child process exited with code 127
Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: ostree-finalize-staged.service: Control process exited, code=exited, status=1/FAILURE
Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: ostree-finalize-staged.service: Failed with result 'exit-code'.
Mar 26 15:33:03 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopped OSTree Finalize Staged Deployment.

sh-5.1# journalctl -b -1 -u rpm-ostreed.service
Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Starting rpm-ostree System Management Daemon...
Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: Reading config file '/etc/rpm-ostreed.conf'
Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: In idle state; will auto-exit in 64 seconds
Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Started rpm-ostree System Management Daemon.
Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: client(id:cli dbus:1.2846 unit:crio-38580971eeeb66b069f7936bb7f9738476da228cd5a3c41928c065745f774134.scope uid:0) added; new total=1
Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: client(id:cli dbus:1.2846 unit:crio-38580971eeeb66b069f7936bb7f9738476da228cd5a3c41928c065745f774134.scope uid:0) vanished; remaining=0
Mar 26 13:51:20 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[836319]: In idle state; will auto-exit in 64 seconds
Mar 26 13:52:25 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Deactivated successfully.
Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Starting rpm-ostree System Management Daemon...
Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: Reading config file '/etc/rpm-ostreed.conf'
Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: In idle state; will auto-exit in 63 seconds
Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Started rpm-ostree System Management Daemon.
Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: client(id:cli dbus:1.3004 unit:crio-39b8158b8fae369b8e6ec77a002294da372e365e88f4a808d9f831e2748f4efc.scope uid:0) added; new total=1
Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: client(id:cli dbus:1.3004 unit:crio-39b8158b8fae369b8e6ec77a002294da372e365e88f4a808d9f831e2748f4efc.scope uid:0) vanished; remaining=0
Mar 26 15:10:32 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[884229]: In idle state; will auto-exit in 64 seconds
Mar 26 15:11:37 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Deactivated successfully.
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Starting rpm-ostree System Management Daemon...
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Reading config file '/etc/rpm-ostreed.conf'
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 61 seconds
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Started rpm-ostree System Management Daemon.
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3048 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) added; new total=1
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Loaded sysroot
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Locked sysroot
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.3048 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0>
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898548 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] connected to transaction progress
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Unlocked sysroot
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898548 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] disconnected from transaction progress
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3048 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) vanished; remaining=0
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 61 seconds
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:cli dbus:1.3049 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) added; new total=1
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:cli dbus:1.3049 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) vanished; remaining=0
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 62 seconds
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3050 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) added; new total=1
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Loaded sysroot
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Locked sysroot
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Initiated txn Rebase for client(id:machine-config-operator dbus:1.3050 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0)>
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898569 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] connected to transaction progress
Mar 26 15:32:10 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Fetching ostree-unverified-image:docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a255bb9116f3362a9800d607bcd6184c5124bd50a41b4b8c1707cdc663154>
Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Created new deployment /ostree/deploy/rhcos/deploy/27551d8ffb2ffbc3af714f5e8cdada699c02b7ae6b3f7fe3a5a16024a43aab2e.4
Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: sanitycheck(/usr/bin/true) successful
Mar 26 15:32:14 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Pruned container image layers: 0
Mar 26 15:32:15 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos successful
Mar 26 15:32:15 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Unlocked sysroot
Mar 26 15:32:15 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: Process [pid: 898569 uid: 0 unit: crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope] disconnected from transaction progress
Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: client(id:machine-config-operator dbus:1.3050 unit:crio-e4a835e92e855aabb0198cd4db56c70f7269398921c1a764d808ffe44834a665.scope uid:0) vanished; remaining=0
Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com rpm-ostree[898552]: In idle state; will auto-exit in 61 seconds
Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopping rpm-ostree System Management Daemon...
Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Deactivated successfully.
Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: Stopped rpm-ostree System Management Daemon.
Mar 26 15:32:16 hosted-worker-024.nodes.prod.psi.rdu2.redhat.com systemd[1]: rpm-ostreed.service: Consumed 3.116s CPU time.

Assignee:: Unassigned

Reporter:: Ahmed Mohamed

Need Info From:: None

Contributors:: None

QA Contact:: Jie Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2025/03/26 3:50 PM

Updated:: 2025/10/16 5:36 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates