-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
-
389-ds-base-3.1.3-2.el10
-
No
-
Moderate
-
rhel-idm-ds
-
0
-
False
-
False
-
-
No
-
None
-
Pass
-
RegressionOnly
-
Release Note Not Required
-
Unspecified
-
Unspecified
-
Unspecified
-
None
Description of problem:
We had a RHDS hang while running an online backup task together with a automember_rebuild task
Version-Release number of selected component (if applicable):
RHDS 10.4
How reproducible:
The scenario is to have an online backup task running while we also have a automember_rebuild task running
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Taking a pstack during the hang, we have these 2 threads:
Thread 2 is running a backup. It writes the CL RUV (so hold the CL RUV lock) on the database. The CL RUV in the database is on pages already acquired by Thread 4
#0 0x00007f18b21a7a35 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1 0x00007f18aac99903 in __db_hybrid_mutex_suspend () at /lib64/libdb-5.3.so
#2 0x00007f18aac98c50 in __db_tas_mutex_lock () at /lib64/libdb-5.3.so
#3 0x00007f18aad4334a in __lock_get_internal () at /lib64/libdb-5.3.so
#4 0x00007f18aad43e30 in __lock_get () at /lib64/libdb-5.3.so
...
#12 0x00007f18a5d80815 in _cl5CheckCSNinCL () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#13 0x00007f18a5db7705 in ruv_enumerate_elements () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#14 0x00007f18a5d80eb9 in _cl5WriteRUV () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
...
#19 0x00007f18b4a7f891 in task_backup_thread () at /usr/lib64/dirsrv/libslapd.so.0
Thread 4 is doing an automember rebuild. It triggers internal MODs all of them under a transaction that hold many database locks. Finally the MODS are logged into replication changelog. During Changelog logging it updates the CL RUV that is held by the Thread 2
#0 0x00007f18b21a739e in pthread_rwlock_wrlock () at /lib64/libpthread.so.0
#1 0x00007f18a5db71c8 in ruv_set_csns () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#2 0x00007f18a5d8124e in _cl5UpdateRUV () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3 0x00007f18a5d859cb in cl5WriteOperationTxn () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#4 0x00007f18a5da3be8 in write_changelog_and_ruv () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#5 0x00007f18a5da4f0d in multimaster_mmr_postop () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
...
#35 0x00007f18a6697b25 in ldbm_back_modify () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
...
#39 0x00007f18a95be15b in automember_rebuild_task_thread () at /usr/lib64/dirsrv/plugins/libautomember-plugin.so
We have a deadlock scenario because of 2 threads acquiring locks (DB pages / RUV lock) in the opposite order. Both task doing updates, it is probable that deadlock detection can not help
As a consequence, we see a lot of threads hang, waiting on the DB lock
#0 0x00007f18b21a7a35 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1 0x00007f18b27fe483 in PR_EnterMonitor () at /lib64/libnspr4.so
#2 0x00007f18a665b0a6 in dblayer_txn_begin () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007f18a665b10d in dblayer_plugin_begin () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
#4 0x00007f18b49fb40e in slapi_back_transaction_begin () at /usr/lib64/dirsrv/libslapd.so.0
#5 0x00007f18a95be27a in automember_rebuild_task_thread () at /usr/lib64/dirsrv/plugins/libautomember-plugin.so
#6 0x00007f18b2803bfb in _pt_root () at /lib64/libnspr4.so
#7 0x00007f18b21a3ea5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007f18b184f8cd in clone () at /lib64/libc.so.6
- external trackers
- links to
-
RHBA-2025:151590
389-ds-base update