Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36198

kubelet does not start after reboot due to dependency issue

XMLWordPrintable

    • No
    • 255 - Integration & Delivery
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, for clusters upgraded from older versions of OCP, enabling `kdump` on an OVN-enabled cluster sometimes prevented the node from rejoining the cluster or returning to the `Ready` state.
      This fix removes stale data from older OCP versions and ensures this stale data is always cleaned up. The node can now start correctly and rejoin the cluster.
      Show
      Previously, for clusters upgraded from older versions of OCP, enabling `kdump` on an OVN-enabled cluster sometimes prevented the node from rejoining the cluster or returning to the `Ready` state. This fix removes stale data from older OCP versions and ensures this stale data is always cleaned up. The node can now start correctly and rejoin the cluster.
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-33694. The following is the description of the original issue:

      Description of problem:

      kubelet does not start after reboot due to dependency issue

      Version-Release number of selected component (if applicable):

       OCP 4.14.23
        

      How reproducible:

          Every time at customer end

      Steps to Reproduce:

          1. Upgrade Openshift cluster (OVN based) with kdump enabled to OCP 4.14.23
          2. Check kubelet and crio status 
          
          

      Actual results:

          kubelet and crio services are in dead state and do not start automatically after reboot, manual intervention is needed.
      
      $ cat sos_commands/crio/systemctl_status_crio 
      ○ crio.service - Container Runtime Interface for OCI (CRI-O)
           Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; preset: disabled)
          Drop-In: /etc/systemd/system/crio.service.d
                   └─01-kubens.conf, 05-mco-ordering.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf
           Active: inactive (dead)
             Docs: https://github.com/cri-o/cri-o$ cat sos_commands/openshift/systemctl_status_kubelet 
      ○ kubelet.service - Kubernetes Kubelet
           Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
          Drop-In: /etc/systemd/system/kubelet.service.d
                   └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
           Active: inactive (dead)
      

      Expected results:

          kubelet and crio should start automatically.

      Additional info:

      I feel the recent patch to wait till kdump starts has broken the ordering cycle.
      
      https://github.com/openshift/machine-config-operator/pull/4213/files
      
      May 09 19:12:05 network01 systemd[1]: network-online.target: Found dependency on kdump.service/start
      May 09 19:13:48 network01 systemd[1]: ovs-configuration.service: Found ordering cycle on kdump.service/start
      May 09 19:13:48 network01 systemd[1]: ovs-configuration.service: Job kdump.service/start deleted to break ordering cycle starting with ovs-configuration.service/start
      May 12 21:20:57 network01 systemd[1]: node-valid-hostname.service: Found dependency on kdump.service/start
      May 12 21:21:00 network01 kdumpctl[1389]: kdump: kexec: loaded kdump kernel
      May 12 21:21:00 network01 kdumpctl[1389]: kdump: Starting kdump: [OK]
      May 12 21:25:28 network01 systemd[1]: kdump.service: Found ordering cycle on network-online.target/start
      May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on node-valid-hostname.service/start
      May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on ovs-configuration.service/start
      May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on kdump.service/start
      May 12 21:25:28 network01 systemd[1]: kdump.service: Job network-online.target/start deleted to break ordering cycle starting with kdump.service/start
      May 12 21:25:31 network01 kdumpctl[1284]: kdump: kexec: loaded kdump kernel
      May 12 21:25:31 network01 kdumpctl[1284]: kdump: Starting kdump: [OK]
      
      To break a cycle, systemd deletes a job part of the cycle, making the corresponding service not to be started.
        Disabling kdump and rebooting the node helps, kubelet and crio start automatically. 
      
      # systemctl disable kdump
      
      # systemctl reboot
      
      Make sure systemctl list-jobs do not have any pending jobs, once it is completed, we can check status of kubelet.
      
      # systemctl list-jobs
      
      # systemctl status kubelet

              team-mco Team MCO
              openshift-crt-jira-prow OpenShift Prow Bot
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: