Uploaded image for project: 'mod_cluster'
  1. mod_cluster
  2. MODCLUSTER-398

mod_cluster deadlock in a jboss/windows environment

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 1.2.9.Final, 1.3.1.Alpha1
    • 1.2.6.Final
    • None
    • None

      Under load Apache stops serving pages, with all threads are stuck in "W : Sending reply" state. With the windows Process Explorer we then got a stacktrace from a hanging thread. We don't have debug symbols, but it's easy enough to see what's happening:

      ntoskrnl.exe!KeWaitForMultipleObjects+0xc0a
      ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x732
      ntoskrnl.exe!KeWaitForMutexObject+0x19f
      ntoskrnl.exe!NtDeleteFile+0x3c4
      ntoskrnl.exe!PsDereferenceKernelStack+0x35358
      ntoskrnl.exe!KeSynchronizeExecution+0x3a23
      ntdll.dll!ZwLockFile+0xa
      KERNELBASE.dll!LockFileEx+0xb2
      kernel32.dll!LockFileEx+0x1b
      libapr-1.dll!apr_file_lock+0x69 <-- here
      mod_slotmem.so+0x1318 <-- here
      mod_manager.so+0x2a11 <-- here
      mod_proxy_cluster.so+0x679e
      mod_proxy.so!proxy_run_post_request+0x4e
      mod_proxy.so!proxy_run_request_status+0x924
      libhttpd.dll!ap_run_handler+0x35
      libhttpd.dll!ap_invoke_handler+0x114
      libhttpd.dll!ap_die+0x2ea
      libhttpd.dll!ap_psignature+0x1ae8
      libhttpd.dll!ap_run_process_connection+0x35
      libhttpd.dll!ap_process_connection+0x3b
      libhttpd.dll!ap_regkey_value_remove+0x136e
      msvcrt.dll!srand+0x93
      msvcrt.dll!ftime64_s+0x1dd
      kernel32.dll!BaseThreadInitThunk+0xd
      ntdll.dll!RtlUserThreadStart+0x21

      So mod_manager is requesting a filelock on one of the lockfiles in in the MemManagerFile path. In this case it was the "manager.sessionid.sessionid.lock" file. Removing the lockfile fixed the problem.

      When bisecting the mod_cluster code, I think commit "74eeb9c026380deb8d833be53b09b3d808e02d10 - Lock in insert-update" in version 1.2.2 is the culprit. This would also explain why mod_cluster 1.2.1 is the last known working version.

      What we don't know, is which process is already holding the lock when all Apache threads start blocking on it. We are trying to figure that out. There are no obviously wrong lock/unlock slotmem call pairs in the mod_manager module, and no locks are requested within other locks as far as we can see. Therefor our best guess would be a deadlock on a thread already holding the globalmutex_lock in combination with the slotmem file locks, but that's just a guess without debugging it.

      More context can be found here: https://bugzilla.redhat.com/show_bug.cgi?id=1080047

              rhn-engineering-jclere Jean-Frederic Clere
              uwog_jira Marc Maurer (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: