Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74938

Kubelet and NetworkManager do not start automatically on any node after reboot, leaving nodes stuck in NotReady

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.17.z, 4.18.z
    • RHCOS
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          After rebooting OpenShift worker nodes, NetworkManager does not start automatically, which causes the node to remain in NotReady state. Since networking is unavailable, the node cannot pull images or rejoin the cluster.
      Manual intervention is required to regain access to the node, start NetworkManager, and restore partial functionality. However, kubelet still does not start automatically even after NetworkManager is started, and the node never recovers on its own, even after waiting for an extended period.
      This behavior has been observed consistently across multiple clusters, indicating a systemic issue rather than a one-off node failure.
      The issue appears similar to https://issues.redhat.com/browse/OCPBUGS-36198, suggesting a possible regression or related condition involving Machine Config Operator and NetworkManager initialization logic.

      Version-Release number of selected component (if applicable):

          4.17.z , 4.18.z

      How reproducible:

          not reproducible in our own environment

      Steps to Reproduce:

          1. only reproducible in customer environment
          

      Actual results:

          NetworkManager does not start automatically after reboot.
      
      
      Node remains stuck in NotReady.
      
      
      Images are not pulled due to lack of networking.
      
      
      Manual recovery steps required:
      
      
      Reset core user password.
      
      
      SSH into the node.
      
      
      Manually start NetworkManager:
      systemctl start NetworkManager
      
      
      
      
      
      After NetworkManager starts, image pulls begin.
      
      
      kubelet still does not start automatically, even after waiting for days.
      
      
      Node never recovers without further manual intervention.

      Expected results:

          After reboot:
      
      
      NetworkManager should start automatically.
      
      
      kubelet should start automatically once networking is available.
      
      
      Node should transition back to Ready state without manual intervention.
      
      
      
      
      Nodes should recover fully after reboot, as expected in a production OpenShift cluster.

      Additional info:

          Jan 30 09:09:00 node.example.com systemd[1]:
      Cleans NetworkManager state generated by dracut was skipped
      because of an unmet condition check
      (ConditionPathExists=/var/lib/mco/nm-clean-initrd-state).

              Unassigned Unassigned
              rhn-support-vismishr Vishvranjan Mishra
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: