-
Bug
-
Resolution: Can't Do
-
Normal
-
None
-
4.12.z
-
Important
-
No
-
OCPNODE Sprint 246 (Blue)
-
1
-
False
-
-
-
Description of problem:
We run a test to validate that a kdump can be generated. After provoking a KUDMP on a SNO cluster with a Telco profile, the node was rebooted(expected), but after the reboot none of the OCP pods could start.
Version-Release number of selected component (if applicable):
4.12.37
How reproducible:
It only happened once. We will try to reproduce it.
Steps to Reproduce:
1. echo 1 > /proc/sys/kernel/sysrq 2. echo \"c\" > /proc/sysrq-trigger 3. Wait for node restart
Actual results:
After reboot, oc commands fail due to API unavailable. OCP pods are not present
Expected results:
After reboot, OCP recovers successfully
Additional info:
System impact: Seems a race condition on a kdump generated broke the system. I did an extra reboot but the OCP pods cannot start CRIO service status: [core@cloudransno-site1 ~]$ sudo systemctl status crio.service ● crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf, 90-container-mount-namespace.conf Active: active (running) since Mon 2023-10-09 08:07:58 UTC; 20min ago Docs: https://github.com/cri-o/cri-o Main PID: 4986 (crio) Tasks: 9 Memory: 43.0M CPU: 341ms CGroup: /system.slice/crio.service └─4986 /usr/bin/crioOct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.060677274Z" level=info msg="Conmon does support the --sync option" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.060695173Z" level=info msg="Conmon does support the --log-global-size-max option" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.066464266Z" level=info msg="Conmon does support the --sync option" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.066480683Z" level=info msg="Conmon does support the --log-global-size-max option" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.072012525Z" level=info msg="Conmon does support the --sync option" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.072028811Z" level=info msg="Conmon does support the --log-global-size-max option" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.192167568Z" level=info msg="Found CNI network multus-cni-network (type=multus) at /etc/kubernetes/cni/net.d/00-multus.conf" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.192188565Z" level=info msg="Updated default CNI network name to multus-cni-network" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.250251124Z" level=warning msg="Error encountered when checking whether cri-o should wipe containers: open /var/run/crio/version: no such file or directory" Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.251933354Z" level=info msg="Serving metrics on :9537 via HTTP" However, the file is present in the machine: [core@cloudransno-site1 ~]$ cat /var/run/crio/version "1.25.4-4.rhaos4.12.gitb9319a2.el8+unknown"