Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We are finding cases where dlrn processes get stuck randomly (not very often, maybe every two weeks or so). These are the sympthoms we found in a specific case (does not mean it's always the same):

Main dlrn process wating and children processes defunct:

centos9+ 1761607 1761424 0 01:25 ? 00:00:00 /bin/bash /usr/local/bin/run-dlrn.sh
centos9+ 1761634 1761607 0 01:25 ? 00:00:01 /home/centos9-antelope/.venv/bin/python /home/centos9-antelope/.venv/bin/dlrn
centos9+ 1761707 1761634 0 01:25 ? 00:00:00 /home/centos9-antelope/.venv/bin/python -c from multiprocessing.resource_track
centos9+ 1761711 1761634 0 01:25 ? 00:00:04 [python] <defunct>
centos9+ 1761713 1761634 0 01:25 ? 00:00:03 [python] <defunct>
centos9+ 1761715 1761634 0 01:25 ? 00:00:03 [python] <defunct>
centos9+ 1761716 1761634 0 01:25 ? 00:00:03 [python] <defunct>

In this case there are logs of problems connecting to the database:

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '38.102.83.62' ([Errno 104] Connection reset by peer)")

In the database server there are also mariadb logs of aborted connections:

2024-08-08 1:26:16 80241468 [Warning] Aborted connection 80241468 to db: 'dlrn_centos9_antelope' user: 'centos9-antelope' host: '38.129.56.237' (Got an error reading communication packets)
2024-08-08 1:26:16 80241477 [Warning] Aborted connection 80241477 to db: 'dlrn_centos9_antelope' user: 'centos9-antelope' host: '38.129.56.237' (Got an error reading communication packets)
2024-08-08 1:26:16 80241438 [Warning] Aborted connection 80241438 to db: 'dlrn_centos9_antelope' user: 'centos9-antelope' host: '38.129.56.237' (Got an error reading communication packets)

In prometheus metrics i can not find any abnormal metric at that time (timezone of logs is EDT).

There are two different sides that we can work on:

Identify the root cause of the connection abort, it may be something in the db-server, trunk-builder or even in the network. I see there are many processes in TIME_WAIT, i.e. that could be optimized, but i have doubts that's the real issue.
Improve dlrn behavior to handle this kind of issues. DLRN processes should just report error and die in this situation.

One more consideration, we are now monitoring this situation and can manually handle it by killing dlrn processes.

Assignee:: Unassigned

Reporter:: Alfredo Moralejo Alonso

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/08/08 10:06 AM

Updated:: 2024/08/14 2:56 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates