Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36412

On RHOCP 4.12.z many "catatonit -P" processes were remaining and caused OOM error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 4.13.0
    • 4.12.z
    • RHCOS
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Customers periodically get the status for all of the node in the RHOCP4.12.z cluster using "ssh core@$node" as follows:
      
      --- Example ---
      while true
      do
        for node in worker-0, worker-1, worker-2
        do
          date
          echo == $node ==
          ssh core@$node cat /proc/stat
          ssh core@$node free
          ssh core@$node df
        done
        sleep 60
      done > /var/tmp/node-status.log
      ----
      
      In some nodes, many "catatonit -P" processes (once every few minutes) were remaining and caused the OOM error on a node just like:
      ~~~
      0 S  1000 3431131       1 3431128 3431128  TS  19    -   251 -                    Fri Jun  7 09:55:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3435042       1 3435038 3435038  TS  19    -   251 -                    Fri Jun  7 09:59:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3438970       1 3438968 3438968  TS  19    -   251 -                    Fri Jun  7 10:03:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3442884       1 3442883 3442883  TS  19    -   251 -                    Fri Jun  7 10:07:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3446845       1 3446841 3446841  TS  19    -   251 -                    Fri Jun  7 10:11:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3450761       1 3450759 3450759  TS  19    -   251 -                    Fri Jun  7 10:15:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3454704       1 3454697 3454697  TS  19    -   251 -                    Fri Jun  7 10:19:24 2024 ?        00:00:00 catatonit -P
      0 S  1000 3457647       1 3457643 3457643  TS  19    -   251 -                    Fri Jun  7 10:22:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3458615       1 3458611 3458611  TS  19    -   251 -                    Fri Jun  7 10:23:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3461564       1 3461562 3461562  TS  19    -   251 -                    Fri Jun  7 10:26:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3462525       1 3462522 3462522  TS  19    -   251 -                    Fri Jun  7 10:27:22 2024 ?        00:00:00 catatonit -P
      0 S  1000 3465493       1 3465491 3465491  TS  19    -   251 -                    Fri Jun  7 10:30:24 2024 ?        00:00:00 catatonit -P
      0 S  1000 3466443       1 3466440 3466440  TS  19    -   251 -                    Fri Jun  7 10:31:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3469384       1 3469380 3469380  TS  19    -   251 -                    Fri Jun  7 10:34:23 2024 ?        00:00:00 catatonit -P
      0 S  1000 3470343       1 3470341 3470341  TS  19    -   251 -                    Fri Jun  7 10:35:23 2024 ?        00:00:00 catatonit -P
      ~~~
      

      Version-Release number of selected component (if applicable):

          4.12.z  (never happend on 4.13.z +)

      How reproducible:

          See the description field

      Steps to Reproduce:

        See the description field
        But, it seems to depend on the timing.
          

      Actual results:

        After ssh login, "catatonit -P" process may remain.

      Expected results:

         After ssh login, "catatonit -P" process is killed.

      Additional info:

          

              Unassigned Unassigned
              rhn-support-hfukumot Hideshi Fukumoto
              Michael Nguyen Michael Nguyen
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: