Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-1493 deadlock between time_init() and seq_wait() in test-barrier unit test
  3. FDP-1526

[RHEL-9 OVS-3.4] deadlock between time_init() and seq_wait() in test-barrier unit test

    • Icon: Sub-task Sub-task
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • openvswitch3.4
    • None
    • 0
    • False
    • Hide

      None

      Show
      None
    • False
    • openvswitch3.4-3.4.3-84.el9fdp
    • rhel-9
    • rhel-net-ovs-dpdk
    • ssg_networking

       Problem Description: Clearly explain the issue.

      There is an ABBA deadlock between time_init() and seq_wait() in OVS:

      1. Thread 1:
        1. poll_block()
        2. time_poll()
        3. time_init()
        4. pthread_once() <-- lock A
        5. do_time_init()
        6. seq_create()
        7. pthread_mutex_lock(seq_mutex) <-- lock B
      2. Thread 2:
        1. seq_wait(different seqno)
        2. pthread_mutex_lock(seq_mutex) <-- lock B
        3. poll_immediate_wake()
        4. poll_timer_wait()
        5. time_msec()
        6. time_init()
        7. pthread_once() <-- lock A

      I believe, this is the same deadlock upstream Intel CI saw a few years ago.

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

      This deadlock should normally never happen in real ovs-vswitchd, since time module will be initialized long before other threads are created.  But it happens sometimes in the test-barrier unit test application when running on a high-core count system with high testsuite parallelism, e.g. make check TESTSUITEFLAGS='-j40'.

       Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

      It's a race, hard to reproduce consistently.

       Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

      run make check TESTSUITEFLAGS='-j40' on a high-core count system.

       Expected Behavior: Describe what should happen under normal circumstances.

      No deadlock.

       Observed Behavior: Explain what actually happens.

      The testsuite hangs as test-barrier application is stuck.

      Full stacktrace:

      Using host libthread_db library "/lib64/libthread_db.so.1".
      futex_wait (private=0, expected=2, futex_word=0x680940 <seq_mutex>) at ../sysdeps/nptl/futex-internal.h:146
      146       int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
      Missing separate debuginfos, use: dnf debuginfo-install elfutils-libelf-0.192-5.el9.x86_64 libatomic-11.5.0-5.el9_5.x86_64 libbpf-1.5.0-1.el9.x86_64 libcap-2.48-9.el9_2.x86_64 libgcc-11.5.0-5.el9_5.x86_64 libgcrypt-1.10.0-11.el9.x86_64 libgpg-error-1.42-5.el9.x86_64 libzstd-1.5.5-1.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 numactl-libs-2.0.19-1.el9.x86_64 python3-libs-3.9.21-2.el9.x86_64 systemd-libs-252-51.el9.x86_64 unbound-libs-1.16.2-17.el9.x86_64 xz-libs-5.2.5-8.el9_0.x86_64
      (gdb) info threads
        Id   Target Id                                         Frame 
      * 1    Thread 0x7f6272216ac0 (LWP 309453) "ovstest"      futex_wait (private=0, expected=2, futex_word=0x680940 <seq_mutex>)
          at ../sysdeps/nptl/futex-internal.h:146
        2    Thread 0x7f627192e640 (LWP 309480) "ovs-barrier2" futex_wait (private=0, expected=1, futex_word=0x693c64 <once>)
          at ../sysdeps/nptl/futex-internal.h:146
        3    Thread 0x7f627112d640 (LWP 309481) "ovs-barrier0" futex_wait (private=0, expected=1, futex_word=0x693c64 <once>)
          at ../sysdeps/nptl/futex-internal.h:146
        4    Thread 0x7f627092c640 (LWP 309482) "ovs-barrier1" futex_wait (private=0, expected=1, futex_word=0x693c64 <once>)
          at ../sysdeps/nptl/futex-internal.h:146
        5    Thread 0x7f6267fff640 (LWP 309484) "ovs-barrier4" futex_wait (private=0, expected=1, futex_word=0x693c64 <once>)
          at ../sysdeps/nptl/futex-internal.h:146
      (gdb) bt
      #0  futex_wait (private=0, expected=2, futex_word=0x680940 <seq_mutex>) at ../sysdeps/nptl/futex-internal.h:146
      #1  __GI___lll_lock_wait (futex=futex@entry=0x680940 <seq_mutex>, private=0) at lowlevellock.c:50
      #2  0x00007f627208d4d2 in lll_mutex_lock_optimized (mutex=0x680940 <seq_mutex>) at pthread_mutex_lock.c:49
      #3  ___pthread_mutex_lock (mutex=mutex@entry=0x680940 <seq_mutex>) at pthread_mutex_lock.c:94
      #4  0x00000000004c5df9 in ovs_mutex_lock_at (l_=l_@entry=0x680940 <seq_mutex>, where=where@entry=0x5d576d "lib/seq.c:84")
          at lib/ovs-thread.c:77
      #5  0x00000000004d6a7c in seq_create () at lib/seq.c:84
      #6  0x00000000004e20a1 in do_init_time () at lib/timeval.c:132
      #7  0x00007f627208f2f8 in __pthread_once_slow (once_control=0x693c64 <once>, init_routine=0x4e2090 <do_init_time>)
          at pthread_once.c:117
      #8  0x00007f627208f3e5 in ___pthread_once (once_control=<optimized out>, init_routine=<optimized out>) at pthread_once.c:144
      #9  0x00000000004e24c3 in time_init () at lib/timeval.c:145
      #10 time_poll (pollfds=pollfds@entry=0x1b6f2cb0, n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
          elapsed=elapsed@entry=0x7ffd7a355e5c) at lib/timeval.c:296
      #11 0x00000000004cdbf8 in poll_block () at ./include/openvswitch/hmap.h:264
      #12 0x00000000004c6c56 in ovs_barrier_block (barrier=<optimized out>) at lib/ovs-thread.c:386
      #13 0x00000000004148bb in basic_block_check (expected=<optimized out>, n=<optimized out>, aux=<optimized out>)
          at tests/test-barrier.c:71
      #14 test_barrier_basic (n_threads=<optimized out>) at tests/test-barrier.c:113
      #15 test_barrier (argv=<optimized out>, argc=<optimized out>) at tests/test-barrier.c:260
      #16 ovstest_wrapper_test_barrier__ (ctx=<optimized out>) at tests/test-barrier.c:264
      #17 0x000000000043fb7e in ovs_cmdl_run_command__ (ctx=ctx@entry=0x7ffd7a355f50, commands=<optimized out>, 
          read_only=read_only@entry=false) at lib/command-line.c:247
      #18 0x00000000004401c7 in ovs_cmdl_run_command (ctx=ctx@entry=0x7ffd7a355f50, commands=<optimized out>) at lib/command-line.c:278
      #19 0x0000000000407d83 in main (argc=2, argv=0x7ffd7a3560a8) at tests/ovstest.c:133
      (gdb) thread 2
      [Switching to thread 2 (Thread 0x7f627192e640 (LWP 309480))]
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      146       int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
      (gdb) bt
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      #1  futex_wait_simple (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:177
      #2  __pthread_once_slow (once_control=0x693c64 <once>, init_routine=0x4e2090 <do_init_time>) at pthread_once.c:106
      #3  0x00007f627208f3e5 in ___pthread_once (once_control=<optimized out>, init_routine=<optimized out>) at pthread_once.c:144
      #4  0x00000000004e21a3 in time_init () at lib/timeval.c:145
      #5  time_timespec__ (c=0x693d20 <monotonic_clock>, ts=0x7f627192d9f0) at lib/timeval.c:153
      #6  0x00000000004e22d1 in time_msec__ (c=0x693d20 <monotonic_clock>) at lib/timeval.c:218
      #7  time_msec () at lib/timeval.c:226
      #8  0x00000000004cdb09 in poll_timer_wait_at (where=0x5d2a5e "lib/ovs-thread.c:385", msec=0) at lib/poll-loop.c:184
      #9  poll_immediate_wake_at (where=where@entry=0x5d2a5e "lib/ovs-thread.c:385") at lib/poll-loop.c:232
      #10 0x00000000004d6be6 in seq_wait__ (where=<optimized out>, value=<optimized out>, seq=<optimized out>) at lib/seq.c:179
      #11 seq_wait_at (seq_=0x1b6f2360, value=value@entry=1, where=where@entry=0x5d2a5e "lib/ovs-thread.c:385") at lib/seq.c:220
      #12 0x00000000004c6c51 in ovs_barrier_block (barrier=<optimized out>) at lib/ovs-thread.c:385
      #13 0x00000000004146ff in basic_blocker_main (aux_=0x1b6f2390) at tests/test-barrier.c:47
      #14 0x00000000004c5db5 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:429
      #15 0x00007f627208a0ea in start_thread (arg=<optimized out>) at pthread_create.c:443
      #16 0x00007f627210f150 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
      (gdb) thread 3
      [Switching to thread 3 (Thread 0x7f627112d640 (LWP 309481))]
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      146       int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
      (gdb) bt
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      #1  futex_wait_simple (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:177
      #2  __pthread_once_slow (once_control=0x693c64 <once>, init_routine=0x4e2090 <do_init_time>) at pthread_once.c:106
      #3  0x00007f627208f3e5 in ___pthread_once (once_control=<optimized out>, init_routine=<optimized out>) at pthread_once.c:144
      #4  0x00000000004e24c3 in time_init () at lib/timeval.c:145
      #5  time_poll (pollfds=pollfds@entry=0x7f6268000b60, n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
          elapsed=elapsed@entry=0x7f627112ca2c) at lib/timeval.c:296
      #6  0x00000000004cdbf8 in poll_block () at ./include/openvswitch/hmap.h:264
      #7  0x00000000004c6c56 in ovs_barrier_block (barrier=<optimized out>) at lib/ovs-thread.c:386
      #8  0x00000000004146ff in basic_blocker_main (aux_=0x1b6f239c) at tests/test-barrier.c:47
      #9  0x00000000004c5db5 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:429
      #10 0x00007f627208a0ea in start_thread (arg=<optimized out>) at pthread_create.c:443
      #11 0x00007f627210f150 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
      (gdb) thread 4
      [Switching to thread 4 (Thread 0x7f627092c640 (LWP 309482))]
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      146       int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
      (gdb) bt
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      #1  futex_wait_simple (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:177
      #2  __pthread_once_slow (once_control=0x693c64 <once>, init_routine=0x4e2090 <do_init_time>) at pthread_once.c:106
      #3  0x00007f627208f3e5 in ___pthread_once (once_control=<optimized out>, init_routine=<optimized out>) at pthread_once.c:144
      #4  0x00000000004e24c3 in time_init () at lib/timeval.c:145
      #5  time_poll (pollfds=pollfds@entry=0x7f626c000b60, n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
          elapsed=elapsed@entry=0x7f627092ba2c) at lib/timeval.c:296
      #6  0x00000000004cdbf8 in poll_block () at ./include/openvswitch/hmap.h:264
      #7  0x00000000004c6c56 in ovs_barrier_block (barrier=<optimized out>) at lib/ovs-thread.c:386
      #8  0x00000000004146ff in basic_blocker_main (aux_=0x1b6f23a8) at tests/test-barrier.c:47
      #9  0x00000000004c5db5 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:429
      #10 0x00007f627208a0ea in start_thread (arg=<optimized out>) at pthread_create.c:443
      #11 0x00007f627210f150 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
      (gdb) thread 5
      [Switching to thread 5 (Thread 0x7f6267fff640 (LWP 309484))]
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      146       int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
      (gdb) bt
      #0  futex_wait (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:146
      #1  futex_wait_simple (private=0, expected=1, futex_word=0x693c64 <once>) at ../sysdeps/nptl/futex-internal.h:177
      #2  __pthread_once_slow (once_control=0x693c64 <once>, init_routine=0x4e2090 <do_init_time>) at pthread_once.c:106
      #3  0x00007f627208f3e5 in ___pthread_once (once_control=<optimized out>, init_routine=<optimized out>) at pthread_once.c:144
      #4  0x00000000004e24c3 in time_init () at lib/timeval.c:145
      #5  time_poll (pollfds=pollfds@entry=0x7f6258000b60, n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
          elapsed=elapsed@entry=0x7f6267ffea2c) at lib/timeval.c:296
      #6  0x00000000004cdbf8 in poll_block () at ./include/openvswitch/hmap.h:264
      #7  0x00000000004c6c56 in ovs_barrier_block (barrier=<optimized out>) at lib/ovs-thread.c:386
      #8  0x000000000041470d in basic_blocker_main (aux_=0x1b6f23b4) at tests/test-barrier.c:49
      #9  0x00000000004c5db5 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:429
      #10 0x00007f627208a0ea in start_thread (arg=<optimized out>) at pthread_create.c:443
      #11 0x00007f627210f150 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
      

              imaximet@redhat.com Ilya Maximets
              imaximet@redhat.com Ilya Maximets
              HOU MINXI HOU MINXI
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: