Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-25560

gnome-keyring-daemon hangs and pins one CPU core if SSH agent fails to start

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-9.5
    • gnome-keyring
    • None
    • sst_desktop
    • ssg_desktop
    • False
    • Hide

      None

      Show
      None
    • CentOS Stream, Red Hat Enterprise Linux
    • DESKTOP Cycle #1 10.beta phase, DESKTOP Cycle #2 10.beta phase

      Would it be possible to backport the following gnome-keyring patches to RHEL8 and/or 9?

      These prevent a deadlock that causes the SSH agent to spin and new OpenSSH client sessions to hang forever. The bug has been reported repeatedly, both to Red Hat (see RHEL-9302 against RHEL7) and upstream (https://gitlab.gnome.org/GNOME/gnome-keyring/-/issues/25, https://gitlab.gnome.org/GNOME/gnome-keyring/-/issues/70, https://bugzilla.gnome.org/show_bug.cgi?id=794848).

      Is this something Red Hat would be interested in doing? If provided, would patches against CentOS Stream 8 and/or 9 be accepted?

      I've included a copy of our internal bug ticket below, for reference.

      One of our users reported that git pushes to GitHub were hanging on their workstation. A look at the state of the workstation showed that the user's gnome-keyring-daemon (gnome-keyring-0:3.28.2-1.el8.x86_64 as shipped in RHEL8) was consuming 100% of a CPU core on that machine; as gnome-keyring-daemon provides SSH agent services, this causes git push operations to hang as follows:

      • The user attempts git push to an ssh destination (the most common choice - few users create the access tokens required to use git push to GitHub over SSH);
      • The git client launches OpenSSH to talk to the repository server;
      • OpenSSH attempts to query the running SSH agent to discover what SSH keys it has available;
      • gnome-keyring-daemon's SSH agent responder is stuck and never replies, causing OpenSSH (and git) to hang forever.

      A look around other lab machines showed one other runaway gnome-keyring-daemon process belonging to another user, so this isn't solely an isolated incident.

      I listed the threads for a stuck gnome-keyring-daemon with top -H -p PID. I attached to the stuck thread (gdb -p TID run as root) and used the bt command to obtain a backtrace:

      (gdb) bt
      #0  0x00007fcc52ef5138 in g_mutex_unlock () from /lib64/libglib-2.0.so.0
      #1  0x00007fcc52eadccd in g_main_context_iterate.isra ()
         from /lib64/libglib-2.0.so.0
      #2  0x00007fcc52eadf40 in g_main_context_iteration ()
         from /lib64/libglib-2.0.so.0
      #3  0x000055f32b0971d9 in gkd_ssh_agent_process_connect (self=0x55f32c5ad400, 
          cancellable=0x55f32c5c0610, error=error@entry=0x7fcc4d52a4c8)
          at daemon/ssh-agent/gkd-ssh-agent-process.c:232
      #4  0x000055f32b095a78 in on_run (service=<optimized out>, 
          connection=0x55f32c5ef720, source_object=<optimized out>, 
          user_data=<optimized out>) at daemon/ssh-agent/gkd-ssh-agent-service.c:297
      #5  0x00007fcc515ff17e in ffi_call_unix64 () from /lib64/libffi.so.6
      #6  0x00007fcc515feb2f in ffi_call () from /lib64/libffi.so.6
      #7  0x00007fcc5318b386 in g_cclosure_marshal_generic_va ()
         from /lib64/libgobject-2.0.so.0
      #8  0x00007fcc5318a616 in _g_closure_invoke_va ()
         from /lib64/libgobject-2.0.so.0
      #9  0x00007fcc531a6525 in g_signal_emit_valist ()
         from /lib64/libgobject-2.0.so.0
      #10 0x00007fcc531a70e3 in g_signal_emit () from /lib64/libgobject-2.0.so.0
      #11 0x00007fcc53462ebc in g_threaded_socket_service_func ()
         from /lib64/libgio-2.0.so.0
      #12 0x00007fcc52ed6ef3 in g_thread_pool_thread_proxy ()
         from /lib64/libglib-2.0.so.0
      #13 0x00007fcc52ed64ea in g_thread_proxy () from /lib64/libglib-2.0.so.0
      #14 0x00007fcc51fd51ca in start_thread () from /lib64/libpthread.so.0
      #15 0x00007fcc51c41e73 in clone () from /lib64/libc.so.6
      

      Looking at other threads of the process revealed one that was doing the following:

      #0  0x00007fcc51c419bd in syscall () from /lib64/libc.so.6
      #1  0x00007fcc52ef487c in g_mutex_lock_slowpath () from /lib64/libglib-2.0.so.0
      #2  0x000055f32b096ec7 in on_child_watch (pid=161639, status=256, 
          user_data=<optimized out>) at daemon/ssh-agent/gkd-ssh-agent-process.c:133
      #3  0x00007fcc52eaa418 in g_child_watch_dispatch ()
         from /lib64/libglib-2.0.so.0
      #4  0x00007fcc52eadaed in g_main_context_dispatch ()
         from /lib64/libglib-2.0.so.0
      #5  0x00007fcc52eadea8 in g_main_context_iterate.isra ()
         from /lib64/libglib-2.0.so.0
      #6  0x00007fcc52eae1d2 in g_main_loop_run () from /lib64/libglib-2.0.so.0
      #7  0x000055f32b0719fa in main (argc=<optimized out>, argv=<optimized out>)
          at daemon/gkd-main.c:1165
      

      gkd_ssh_agent_process_connect() in the first thread is running the GLib main loop while holding self->lock, waiting for on_output_watch() to set self->ready. However, if the SSH agent has already exited, on_child_watch() will be executed on a second thread, which tries to take self->lock - that creates a deadlock. The timeout in gkd_ssh_agent_process_connect() seems to be ineffective because it's also triggered by an event and the entire event-handling flow is stuck due to the deadlock.

      In this case, it looks like the SSH agent is exiting early because there are two copies of gnome-keyring-daemon running, they both try to spawn an SSH agent listening on the same socket path /run/user/UIDNUM/keyring/.ssh, and the second SSH agent understandably refuses to start:

      $ ssh-agent -D -a /run/user/1000/keyring/.ssh
      unix_listener: cannot bind to path /run/user/1000/keyring/.ssh: Address already in use
      

            dking@redhat.com David King
            steven676 Steven Luo
            David King David King
            Radek Duda Radek Duda
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: