Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: RHCOS
Labels:

Severity:
Important
Regression:
No
Sprint:
OCPNODE Sprint 246 (Blue)
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Internal Whiteboard:
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

We run a test to validate that a kdump can be generated. After provoking a KUDMP on a SNO cluster with a Telco profile, the node was rebooted(expected), but after the reboot none of the OCP pods could start.

Version-Release number of selected component (if applicable):

4.12.37

How reproducible:

It only happened once. We will try to reproduce it.

Steps to Reproduce:

1. echo 1 > /proc/sys/kernel/sysrq
2. echo \"c\" > /proc/sysrq-trigger
3. Wait for node restart

Actual results:

After reboot, oc commands fail due to API unavailable. OCP pods are not present

Expected results:

After reboot, OCP recovers successfully

Additional info:

System impact: Seems a race condition on a kdump generated broke the system. I did an extra reboot but the OCP pods cannot start


CRIO service status:

[core@cloudransno-site1 ~]$ sudo systemctl status crio.service 
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf, 90-container-mount-namespace.conf
   Active: active (running) since Mon 2023-10-09 08:07:58 UTC; 20min ago
     Docs: https://github.com/cri-o/cri-o
 Main PID: 4986 (crio)
    Tasks: 9
   Memory: 43.0M
      CPU: 341ms
   CGroup: /system.slice/crio.service
           └─4986 /usr/bin/crioOct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.060677274Z" level=info msg="Conmon does support the --sync option"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.060695173Z" level=info msg="Conmon does support the --log-global-size-max option"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.066464266Z" level=info msg="Conmon does support the --sync option"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.066480683Z" level=info msg="Conmon does support the --log-global-size-max option"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.072012525Z" level=info msg="Conmon does support the --sync option"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.072028811Z" level=info msg="Conmon does support the --log-global-size-max option"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.192167568Z" level=info msg="Found CNI network multus-cni-network (type=multus) at /etc/kubernetes/cni/net.d/00-multus.conf"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.192188565Z" level=info msg="Updated default CNI network name to multus-cni-network"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.250251124Z" level=warning msg="Error encountered when checking whether cri-o should wipe containers: open /var/run/crio/version: no such file or directory"
Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.251933354Z" level=info msg="Serving metrics on :9537 via HTTP"


However, the file is present in the machine:

[core@cloudransno-site1 ~]$ cat /var/run/crio/version
"1.25.4-4.rhaos4.12.gitb9319a2.el8+unknown"

Assignee:: Unassigned

Reporter:: Rodrigo Lopez Manrique (Inactive)

QA Contact:: Sunil Choudhary

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2023/10/09 8:43 AM

Updated:: 2024/05/02 6:51 PM

Resolved:: 2024/01/11 4:20 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates