Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.13
Component/s: must-gather
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2209298
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.21
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

+++ This bug was initially created as a clone of Bug #2159791 +++

This BZ is created to track the following request from the BZ it was cloned from.

As the issue involved a worker node getting stuck on a potential outstanding IO, it is requested that we have the ability to gather more data from a particular node, specifically:

sosreports
kernel vmcore

Context from the cloned BZ provided below:

— Additional comment from Ilya Dryomov on 2023-05-19 17:13:00 EDT —

(In reply to Shyamsundar from comment #45)
> Also, from the process failing to be killed, the stacks from /proc were from
> when I looked at the system, which potentially leads to stuck IOs that
> causes the process not to exit.

I understand and I'm prepared to believe that the root cause is a system lockup caused by some stuck I/O.

>
> The next class of errors in dmesg occur at around
> "2023-05-13T16:26:11,235898+00:00 TCP: request_sock_TCP: Possible SYN
> flooding on port [::]:9283. Sending cookies."

I'm struggling to come up with an explanation for how a pretty chatty node could suddenly go completely radio silent for over two days other than a system lockup. The usual approach to debugging these is taking at least a sosreport and often a kernel vmcore as well.

To somewhat compensate for the lack of sosreports, we should make ODF must-gather slurp the contents of /sys/bus/rbd and /sys/kernel/debug/ceph directories, with all subdirectories, on all nodes where the kernel client can be running (== where CSI plugin pods are running). This would give us a peek into the I/O queue from the kernel client perspective. Yati, is that something you can take on?

>
> So in the interim is the issue related to potential "vSphere datastore
> filling up" causing the stuck IO? Would this show up in the OSD or other
> Ceph logs?

I don't see anything out of the ordinary in the OSD logs (the log level is low though). Still, the setup recovered after at least one of the monitors and one of the OSDs got replaced and the cluster pretty much restarted due to that. The restart could have cleared up some bad state or caused something to get resent – it's hard to tell at this point.

— Additional comment from Ilya Dryomov on 2023-05-19 17:15:19 EDT —

Hi Yati,

We have a must-gather RFE, see the previous comment.

— Additional comment from yati padia on 2023-05-21 23:57:10 EDT —

Hey Ilya,

Yeah sure that should be fine, but before that, we need to verify if that is affecting the time of must-gather.
We are already facing lots of issues in must-gather. Please feel free to open a bug/REF in must-gather.
I will take care of it.

Thanks,
Yati

— Additional comment from Shyamsundar on 2023-05-23 08:08:12 EDT —

SOS reports can be taken on a case by case basis using https://access.redhat.com/solutions/4387261

It was applied here to generate a sosreport for a different BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2151493#c47

I am forking the additional must-gather request into a separate BZ.

external trackers

Github red-hat-storage/odf-must-gather/pull/52

Assignee:: Yati Padia

Reporter:: Shyam Ranganathan

Need Info From:: Yati Padia

QA Contact:: Aviad Polak

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/05/23 12:12 PM

Updated:: 2025/08/28 5:18 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty