-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
False
-
-
False
-
-
We are finding cases where dlrn processes get stuck randomly (not very often, maybe every two weeks or so). These are the sympthoms we found in a specific case (does not mean it's always the same):
- Main dlrn process wating and children processes defunct:
centos9+ 1761607 1761424 0 01:25 ? 00:00:00 /bin/bash /usr/local/bin/run-dlrn.sh
centos9+ 1761634 1761607 0 01:25 ? 00:00:01 /home/centos9-antelope/.venv/bin/python /home/centos9-antelope/.venv/bin/dlrn
centos9+ 1761707 1761634 0 01:25 ? 00:00:00 /home/centos9-antelope/.venv/bin/python -c from multiprocessing.resource_track
centos9+ 1761711 1761634 0 01:25 ? 00:00:04 [python] <defunct>
centos9+ 1761713 1761634 0 01:25 ? 00:00:03 [python] <defunct>
centos9+ 1761715 1761634 0 01:25 ? 00:00:03 [python] <defunct>
centos9+ 1761716 1761634 0 01:25 ? 00:00:03 [python] <defunct>
- In this case there are logs of problems connecting to the database:
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '38.102.83.62' ([Errno 104] Connection reset by peer)")
- In the database server there are also mariadb logs of aborted connections:
2024-08-08 1:26:16 80241468 [Warning] Aborted connection 80241468 to db: 'dlrn_centos9_antelope' user: 'centos9-antelope' host: '38.129.56.237' (Got an error reading communication packets)
2024-08-08 1:26:16 80241477 [Warning] Aborted connection 80241477 to db: 'dlrn_centos9_antelope' user: 'centos9-antelope' host: '38.129.56.237' (Got an error reading communication packets)
2024-08-08 1:26:16 80241438 [Warning] Aborted connection 80241438 to db: 'dlrn_centos9_antelope' user: 'centos9-antelope' host: '38.129.56.237' (Got an error reading communication packets)
In prometheus metrics i can not find any abnormal metric at that time (timezone of logs is EDT).
There are two different sides that we can work on:
- Identify the root cause of the connection abort, it may be something in the db-server, trunk-builder or even in the network. I see there are many processes in TIME_WAIT, i.e. that could be optimized, but i have doubts that's the real issue.
- Improve dlrn behavior to handle this kind of issues. DLRN processes should just report error and die in this situation.
One more consideration, we are now monitoring this situation and can manually handle it by killing dlrn processes.