Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13.z
Component/s: Machine Config Operator
Labels:
- 4.13
- bug
- mco
- mco-triaged
- update

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

After update to OpenShift Container Platform 4.13.4, scaling OpenShift Container Platform 4 - Node(s) is failing as the provisioned OpenShift Container Platform 4 - Node is stuck due to the below error.

Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
Jul 05 11:47:16 new-node-0 podman[2106]: std::io::Error: No such file or directory (os error 2)
Jul 05 11:47:16 new-node-0 clever_pare[2118]: std::io::Error: No such file or directory (os error 2)
Jul 05 11:47:16 new-node-0 clever_pare[2118]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
Jul 05 11:47:16 new-node-0 podman[2106]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
Jul 05 11:47:16 new-node-0 podman[2106]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry
Jul 05 11:47:16 new-node-0 clever_pare[2118]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry

This appears to be the same problem that was tracked and fixed in https://issues.redhat.com/browse/OCPBUGS-14298 (the fix was part of OpenShift Container Platform 4.13.4). So while the upgrade to OpenShift Container Platform 4.13.3 successfully completed, newly scaled OpenShift Container Platform 4 - Node(s) are now failing because of that issue.

 - When manually creating /etc/systemd/network on the problematic OpenShift Container Platform 4 - Node, the OpenShift Container Platform 4 - Node will eventually join the OpenShift Container Platform 4 - Cluster and report Ready state.

When updating the AMI in the MachineSet  to the AMI for OpenShift Container Platform 4.13.4 scaling new OpenShift Container Platform 4 - Node(s) work without issue. But itthis change in the MachineSet should not be required as this would be a massive effort for all OpenShift Container Platform 4 - Cluster updating to OpenShift Container Platform 4.13.4 and beyond.

 - Also the OpenShift Container Platform 4 - Node is running the Red Hat Enterprise Linux - CoreOS version specified in the AMI of the MachineSet, which is OpenShift Container Platform 4.11. So it's experiencing the problem there and not after the OpenShift Container Platform 4.13.4 update was applied.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.13.4

How reproducible:

Unknown

Steps to Reproduce:

1. OpenShift Container Platform 4 - Cluster updated from OpenShift Container Platform 4.11 to 4.13.4 on AWS
2. Scaling additional Machine via MachineSet

Actual results:

OpenShift Container Platform 4 - Node is stuck in Provisioned state, failing to ever turn ready because of the below error found in the system journal.

Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
Jul 05 11:47:16 new-node-0 podman[2106]: std::io::Error: No such file or directory (os error 2)
Jul 05 11:47:16 new-node-0 clever_pare[2118]: std::io::Error: No such file or directory (os error 2)
Jul 05 11:47:16 new-node-0 clever_pare[2118]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
Jul 05 11:47:16 new-node-0 podman[2106]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
Jul 05 11:47:16 new-node-0 podman[2106]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry
Jul 05 11:47:16 new-node-0 clever_pare[2118]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry

Expected results:

The problem found is the same as tracked in https://issues.redhat.com/browse/OCPBUGS-14298 and thus considered resolved. It's therefore not clear why newly created OpenShift Container Platform 4 - Node may experience that issue and while updating the MachineSet with OpenShift Container Platform 4.13.4 AMI does resolve the issue, this approach is not considered feasible for a fleet of multiple OpenShift Container Platform 4 - Cluster.

Additional info:

is related to

OCPBUGS-14298 Upgrade to OCP 4.13.0 stuck due to machine-config error 'failed to run- nmstatectl: exit status 1'

Closed

links to

Scale-up of OpenShift Container Platform 4 - Node is stuck post OpenShift Container Platform 4.13.4 update

Assignee:: Team MCO

Reporter:: Simon Reber

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/07/07 7:37 AM

Updated:: 2023/08/07 5:36 PM

Resolved:: 2023/08/07 5:36 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates