-
Bug
-
Resolution: Done
-
Major
-
4.12.z
-
Important
-
No
-
False
-
Description of problem:
Customers periodically get the status for all of the node in the RHOCP4.12.z cluster using "ssh core@$node" as follows: --- Example --- while true do for node in worker-0, worker-1, worker-2 do date echo == $node == ssh core@$node cat /proc/stat ssh core@$node free ssh core@$node df done sleep 60 done > /var/tmp/node-status.log ---- In some nodes, many "catatonit -P" processes (once every few minutes) were remaining and caused the OOM error on a node just like: ~~~ 0 S 1000 3431131 1 3431128 3431128 TS 19 - 251 - Fri Jun 7 09:55:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3435042 1 3435038 3435038 TS 19 - 251 - Fri Jun 7 09:59:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3438970 1 3438968 3438968 TS 19 - 251 - Fri Jun 7 10:03:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3442884 1 3442883 3442883 TS 19 - 251 - Fri Jun 7 10:07:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3446845 1 3446841 3446841 TS 19 - 251 - Fri Jun 7 10:11:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3450761 1 3450759 3450759 TS 19 - 251 - Fri Jun 7 10:15:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3454704 1 3454697 3454697 TS 19 - 251 - Fri Jun 7 10:19:24 2024 ? 00:00:00 catatonit -P 0 S 1000 3457647 1 3457643 3457643 TS 19 - 251 - Fri Jun 7 10:22:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3458615 1 3458611 3458611 TS 19 - 251 - Fri Jun 7 10:23:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3461564 1 3461562 3461562 TS 19 - 251 - Fri Jun 7 10:26:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3462525 1 3462522 3462522 TS 19 - 251 - Fri Jun 7 10:27:22 2024 ? 00:00:00 catatonit -P 0 S 1000 3465493 1 3465491 3465491 TS 19 - 251 - Fri Jun 7 10:30:24 2024 ? 00:00:00 catatonit -P 0 S 1000 3466443 1 3466440 3466440 TS 19 - 251 - Fri Jun 7 10:31:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3469384 1 3469380 3469380 TS 19 - 251 - Fri Jun 7 10:34:23 2024 ? 00:00:00 catatonit -P 0 S 1000 3470343 1 3470341 3470341 TS 19 - 251 - Fri Jun 7 10:35:23 2024 ? 00:00:00 catatonit -P ~~~
Version-Release number of selected component (if applicable):
4.12.z (never happend on 4.13.z +)
How reproducible:
See the description field
Steps to Reproduce:
See the description field But, it seems to depend on the timing.
Actual results:
After ssh login, "catatonit -P" process may remain.
Expected results:
After ssh login, "catatonit -P" process is killed.
Additional info: