Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-112

[2209298] Enable must-gather or provide steps to collect sosreports and/or kernel vmcore for openshift nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.13
    • must-gather
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      +++ This bug was initially created as a clone of Bug #2159791 +++

      This BZ is created to track the following request from the BZ it was cloned from.

      As the issue involved a worker node getting stuck on a potential outstanding IO, it is requested that we have the ability to gather more data from a particular node, specifically:

      • sosreports
      • kernel vmcore

      Context from the cloned BZ provided below:

      — Additional comment from Ilya Dryomov on 2023-05-19 17:13:00 EDT —

      (In reply to Shyamsundar from comment #45)
      > Also, from the process failing to be killed, the stacks from /proc were from
      > when I looked at the system, which potentially leads to stuck IOs that
      > causes the process not to exit.

      I understand and I'm prepared to believe that the root cause is a system lockup caused by some stuck I/O.

      >
      > The next class of errors in dmesg occur at around
      > "2023-05-13T16:26:11,235898+00:00 TCP: request_sock_TCP: Possible SYN
      > flooding on port [::]:9283. Sending cookies."

      I'm struggling to come up with an explanation for how a pretty chatty node could suddenly go completely radio silent for over two days other than a system lockup. The usual approach to debugging these is taking at least a sosreport and often a kernel vmcore as well.

      To somewhat compensate for the lack of sosreports, we should make ODF must-gather slurp the contents of /sys/bus/rbd and /sys/kernel/debug/ceph directories, with all subdirectories, on all nodes where the kernel client can be running (== where CSI plugin pods are running). This would give us a peek into the I/O queue from the kernel client perspective. Yati, is that something you can take on?

      >
      > So in the interim is the issue related to potential "vSphere datastore
      > filling up" causing the stuck IO? Would this show up in the OSD or other
      > Ceph logs?

      I don't see anything out of the ordinary in the OSD logs (the log level is low though). Still, the setup recovered after at least one of the monitors and one of the OSDs got replaced and the cluster pretty much restarted due to that. The restart could have cleared up some bad state or caused something to get resent – it's hard to tell at this point.

      — Additional comment from Ilya Dryomov on 2023-05-19 17:15:19 EDT —

      Hi Yati,

      We have a must-gather RFE, see the previous comment.

      — Additional comment from yati padia on 2023-05-21 23:57:10 EDT —

      Hey Ilya,

      Yeah sure that should be fine, but before that, we need to verify if that is affecting the time of must-gather.
      We are already facing lots of issues in must-gather. Please feel free to open a bug/REF in must-gather.
      I will take care of it.

      Thanks,
      Yati

      — Additional comment from Shyamsundar on 2023-05-23 08:08:12 EDT —

      SOS reports can be taken on a case by case basis using https://access.redhat.com/solutions/4387261

      It was applied here to generate a sosreport for a different BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2151493#c47

      I am forking the additional must-gather request into a separate BZ.

              ypadia@redhat.com Yati Padia
              srangana@redhat.com Shyam Ranganathan
              Yati Padia
              Aviad Polak Aviad Polak
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: