Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15924

Scale-up of OpenShift Container Platform 4 - Node is stuck post OpenShift Container Platform 4.13.4 update

XMLWordPrintable

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      After update to OpenShift Container Platform 4.13.4, scaling OpenShift Container Platform 4 - Node(s) is failing as the provisioned OpenShift Container Platform 4 - Node is stuck due to the below error.
      
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
      Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
      Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
      Jul 05 11:47:16 new-node-0 podman[2106]: std::io::Error: No such file or directory (os error 2)
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: std::io::Error: No such file or directory (os error 2)
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
      Jul 05 11:47:16 new-node-0 podman[2106]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
      Jul 05 11:47:16 new-node-0 podman[2106]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry
      
      This appears to be the same problem that was tracked and fixed in https://issues.redhat.com/browse/OCPBUGS-14298 (the fix was part of OpenShift Container Platform 4.13.4). So while the upgrade to OpenShift Container Platform 4.13.3 successfully completed, newly scaled OpenShift Container Platform 4 - Node(s) are now failing because of that issue.
      
       - When manually creating /etc/systemd/network on the problematic OpenShift Container Platform 4 - Node, the OpenShift Container Platform 4 - Node will eventually join the OpenShift Container Platform 4 - Cluster and report Ready state.
      
      When updating the AMI in the MachineSet  to the AMI for OpenShift Container Platform 4.13.4 scaling new OpenShift Container Platform 4 - Node(s) work without issue. But itthis change in the MachineSet should not be required as this would be a massive effort for all OpenShift Container Platform 4 - Cluster updating to OpenShift Container Platform 4.13.4 and beyond.
      
       - Also the OpenShift Container Platform 4 - Node is running the Red Hat Enterprise Linux - CoreOS version specified in the AMI of the MachineSet, which is OpenShift Container Platform 4.11. So it's experiencing the problem there and not after the OpenShift Container Platform 4.13.4 update was applied.
      

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.13.4
      

      How reproducible:

      Unknown
      

      Steps to Reproduce:

      1. OpenShift Container Platform 4 - Cluster updated from OpenShift Container Platform 4.11 to 4.13.4 on AWS
      2. Scaling additional Machine via MachineSet
      

      Actual results:

      OpenShift Container Platform 4 - Node is stuck in Provisioned state, failing to ever turn ready because of the below error found in the system journal.
      
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
      Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] Skipping interface ens5
      Jul 05 11:47:16 new-node-0 podman[2106]: [2023-07-05T11:47:16Z INFO  nmstatectl::persist_nic] No changes.
      Jul 05 11:47:16 new-node-0 podman[2106]: std::io::Error: No such file or directory (os error 2)
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: std::io::Error: No such file or directory (os error 2)
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
      Jul 05 11:47:16 new-node-0 podman[2106]: W0705 11:47:16.013513       1 firstboot_complete_machineconfig.go:63] error: failed to persist network interfaces: failed to run nmstatectl: exit status 1
      Jul 05 11:47:16 new-node-0 podman[2106]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry
      Jul 05 11:47:16 new-node-0 clever_pare[2118]: I0705 11:47:16.013525       1 firstboot_complete_machineconfig.go:64] Sleeping 1 minute for retry
      

      Expected results:

      The problem found is the same as tracked in https://issues.redhat.com/browse/OCPBUGS-14298 and thus considered resolved. It's therefore not clear why newly created OpenShift Container Platform 4 - Node may experience that issue and while updating the MachineSet with OpenShift Container Platform 4.13.4 AMI does resolve the issue, this approach is not considered feasible for a fleet of multiple OpenShift Container Platform 4 - Cluster.
      

      Additional info:

      
      

              team-mco Team MCO
              rhn-support-sreber Simon Reber
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: