Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.z
Affects Version/s: 4.14.z
Component/s: Machine Config Operator
Labels:
- cee.neXT
- osintegration

Regression:
No
Story Points:
1
Sprint:
255 - Integration & Delivery, 256 - Integration & Delivery
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previously, for clusters upgraded from older versions of OCP, enabling `kdump` on an OVN-enabled cluster sometimes prevented the node from rejoining the cluster or returning to the `Ready` state. This fix removes problematic stale data from older OCP versions and ensures this kind of stale data is always cleaned up. The node can now start correctly and rejoin the cluster.

Show
Previously, for clusters upgraded from older versions of OCP, enabling `kdump` on an OVN-enabled cluster sometimes prevented the node from rejoining the cluster or returning to the `Ready` state. This fix removes problematic stale data from older OCP versions and ensures this kind of stale data is always cleaned up. The node can now start correctly and rejoin the cluster.
Release Note Type:
Bug Fix
Release Note Status:
Done
RH Private Keywords:
Target Version:

4.14.z
Target Backport Versions:

4.14.z, 4.15.z, 4.16.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-36258~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-36198~~. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-33694. The following is the description of the original issue:
—
Description of problem:

kubelet does not start after reboot due to dependency issue

Version-Release number of selected component (if applicable):

 OCP 4.14.23

How reproducible:

    Every time at customer end

Steps to Reproduce:

    1. Upgrade Openshift cluster (OVN based) with kdump enabled to OCP 4.14.23
    2. Check kubelet and crio status

Actual results:

    kubelet and crio services are in dead state and do not start automatically after reboot, manual intervention is needed.

$ cat sos_commands/crio/systemctl_status_crio 
○ crio.service - Container Runtime Interface for OCI (CRI-O)
     Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; preset: disabled)
    Drop-In: /etc/systemd/system/crio.service.d
             └─01-kubens.conf, 05-mco-ordering.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf
     Active: inactive (dead)
       Docs: https://github.com/cri-o/cri-o$ cat sos_commands/openshift/systemctl_status_kubelet 
○ kubelet.service - Kubernetes Kubelet
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
     Active: inactive (dead)

Expected results:

    kubelet and crio should start automatically.

Additional info:

I feel the recent patch to wait till kdump starts has broken the ordering cycle.

https://github.com/openshift/machine-config-operator/pull/4213/files

May 09 19:12:05 network01 systemd[1]: network-online.target: Found dependency on kdump.service/start
May 09 19:13:48 network01 systemd[1]: ovs-configuration.service: Found ordering cycle on kdump.service/start
May 09 19:13:48 network01 systemd[1]: ovs-configuration.service: Job kdump.service/start deleted to break ordering cycle starting with ovs-configuration.service/start
May 12 21:20:57 network01 systemd[1]: node-valid-hostname.service: Found dependency on kdump.service/start
May 12 21:21:00 network01 kdumpctl[1389]: kdump: kexec: loaded kdump kernel
May 12 21:21:00 network01 kdumpctl[1389]: kdump: Starting kdump: [OK]
May 12 21:25:28 network01 systemd[1]: kdump.service: Found ordering cycle on network-online.target/start
May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on node-valid-hostname.service/start
May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on ovs-configuration.service/start
May 12 21:25:28 network01 systemd[1]: kdump.service: Found dependency on kdump.service/start
May 12 21:25:28 network01 systemd[1]: kdump.service: Job network-online.target/start deleted to break ordering cycle starting with kdump.service/start
May 12 21:25:31 network01 kdumpctl[1284]: kdump: kexec: loaded kdump kernel
May 12 21:25:31 network01 kdumpctl[1284]: kdump: Starting kdump: [OK]

To break a cycle, systemd deletes a job part of the cycle, making the corresponding service not to be started.
  Disabling kdump and rebooting the node helps, kubelet and crio start automatically. 

# systemctl disable kdump

# systemctl reboot

Make sure systemctl list-jobs do not have any pending jobs, once it is completed, we can check status of kubelet.

# systemctl list-jobs

# systemctl status kubelet

blocks

OCPBUGS-38295 kubelet does not start after reboot due to dependency issue

Closed

clones

OCPBUGS-36258 kubelet does not start after reboot due to dependency issue

Closed

is blocked by

OCPBUGS-36258 kubelet does not start after reboot due to dependency issue

Closed

is cloned by

OCPBUGS-38295 kubelet does not start after reboot due to dependency issue

Closed

links to

openshift/machine-config-operator#4447: [release-4.14] OCPBUGS-36356: daemon/update: disable systemd unit before overwriting

RHBA-2024:4329 OpenShift Container Platform 4.14.z bug fix update

(1 links to)

Assignee:: Team MCO

Reporter:: OpenShift Prow Bot

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: Shane Lovern

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/06/30 10:29 PM

Updated:: 2024/08/09 10:18 PM

Resolved:: 2024/07/11 11:55 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates