Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-80253

RHDS hangs when having online backup running together with automember_rebuild task executing

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • rhel-10.1
    • None
    • 389-ds-base
    • 389-ds-base-3.1.3-2.el10
    • No
    • Moderate
    • rhel-idm-ds
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • No
    • None
    • Release Note Not Required
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      Description of problem:

      We had a RHDS hang while running an online backup task together with a automember_rebuild task

      Version-Release number of selected component (if applicable):

      RHDS 10.4

      How reproducible:

      The scenario is to have an online backup task running while we also have a automember_rebuild task running

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:

      Taking a pstack during the hang, we have these 2 threads:

      Thread 2 is running a backup. It writes the CL RUV (so hold the CL RUV lock) on the database. The CL RUV in the database is on pages already acquired by Thread 4

      #0 0x00007f18b21a7a35 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
      #1 0x00007f18aac99903 in __db_hybrid_mutex_suspend () at /lib64/libdb-5.3.so
      #2 0x00007f18aac98c50 in __db_tas_mutex_lock () at /lib64/libdb-5.3.so
      #3 0x00007f18aad4334a in __lock_get_internal () at /lib64/libdb-5.3.so
      #4 0x00007f18aad43e30 in __lock_get () at /lib64/libdb-5.3.so
      ...
      #12 0x00007f18a5d80815 in _cl5CheckCSNinCL () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      #13 0x00007f18a5db7705 in ruv_enumerate_elements () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      #14 0x00007f18a5d80eb9 in _cl5WriteRUV () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      ...
      #19 0x00007f18b4a7f891 in task_backup_thread () at /usr/lib64/dirsrv/libslapd.so.0

      Thread 4 is doing an automember rebuild. It triggers internal MODs all of them under a transaction that hold many database locks. Finally the MODS are logged into replication changelog. During Changelog logging it updates the CL RUV that is held by the Thread 2

      #0 0x00007f18b21a739e in pthread_rwlock_wrlock () at /lib64/libpthread.so.0
      #1 0x00007f18a5db71c8 in ruv_set_csns () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      #2 0x00007f18a5d8124e in _cl5UpdateRUV () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      #3 0x00007f18a5d859cb in cl5WriteOperationTxn () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      #4 0x00007f18a5da3be8 in write_changelog_and_ruv () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      #5 0x00007f18a5da4f0d in multimaster_mmr_postop () at /usr/lib64/dirsrv/plugins/libreplication-plugin.so
      ...
      #35 0x00007f18a6697b25 in ldbm_back_modify () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
      ...
      #39 0x00007f18a95be15b in automember_rebuild_task_thread () at /usr/lib64/dirsrv/plugins/libautomember-plugin.so

      We have a deadlock scenario because of 2 threads acquiring locks (DB pages / RUV lock) in the opposite order. Both task doing updates, it is probable that deadlock detection can not help

      As a consequence, we see a lot of threads hang, waiting on the DB lock

      #0 0x00007f18b21a7a35 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
      #1 0x00007f18b27fe483 in PR_EnterMonitor () at /lib64/libnspr4.so
      #2 0x00007f18a665b0a6 in dblayer_txn_begin () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
      #3 0x00007f18a665b10d in dblayer_plugin_begin () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
      #4 0x00007f18b49fb40e in slapi_back_transaction_begin () at /usr/lib64/dirsrv/libslapd.so.0
      #5 0x00007f18a95be27a in automember_rebuild_task_thread () at /usr/lib64/dirsrv/plugins/libautomember-plugin.so
      #6 0x00007f18b2803bfb in _pt_root () at /lib64/libnspr4.so
      #7 0x00007f18b21a3ea5 in start_thread () at /lib64/libpthread.so.0
      #8 0x00007f18b184f8cd in clone () at /lib64/libc.so.6

              rhn-engineering-mareynol Mark Reynolds
              rhn-support-rmarigny Renaud Marigny (Inactive)
              IdM DS Dev IdM DS Dev
              IdM DS QE IdM DS QE
              Evgenia Martyniuk Evgenia Martyniuk
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: