Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-621

fapolicyd can leak FDs and never answer request, causing target process to hang forever

Details

    • fapolicyd-1.3.2-100.el9
    • Major
    • sst_security_special_projects
    • 20
    • False
    • Hide

      None

      Show
      None
    • No
    • Bug Fix

    Description

      +++ This bug was initially created as a clone of Bug #2181467 +++

      Description of problem:

      We have a customer hitting hangs of his processes, due to the processes waiting for reply from fapolicyd which never comes.
      On the fapolicyd side, we could see some kind of "FD leak", but on fapolicyd side, the process was mostly idle, waiting for events.

      Digging into the code, we could find that there is a bug when the event cannot be queued, leading to both FD leak and target process hanging:
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
      323 static void enqueue_event(const struct fanotify_event_metadata *metadata)
      324

      { 325 if (q_append(q, metadata)) 326 msg(LOG_DEBUG, "enqueue error"); 327 else 328 set_ready(); 329 }

      341 void handle_events(void)
      342 {
      :
      374 if (metadata->fd >= 0) {
      375 if (metadata->mask & mask)

      { 376 if (metadata->pid == our_pid) 377 approve_event(metadata); 378 else 379 enqueue_event(metadata); 380 }

      381 // For now, prevent leaking descriptors
      382 // in the near future we should do processing
      383 // to update the cache.
      384 else

      { 385 close(metadata->fd); 386 goto out; 387 }

      388 }
      :

      106 /* add DATA to Q */
      107 int q_append(struct queue *q, const struct fanotify_event_metadata *data)
      108 {
      :
      112 if (q->queue_length == q->num_entries)

      { 113 errno = ENOSPC; 114 return -1; 115 }

      :
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      Here above, when handle_events() enqueues an event (line 379), it's possible that the enqueuing fails, due to not having enough space or another issue (line 113 of q_append()).
      In such case enqueue_event() logs are DEBUG level and returns immediately (line 326), causing the FD to never be closed, which creates a FD leak.
      Additionally target process will hang forever since it never gets a reply, since event wasn't enqueued.

      I can reproduce this through simulation using a systemtap script.

      Version-Release number of selected component (if applicable):

      fapolicyd-1.1.3-8.el8_7.1.x86_64

      How reproducible:

      Always using systemtap script below

      Steps to Reproduce:
      1. Start "hack_sha256sum.stp" stap script in attachment

      The script will simulate a "queue full" when getting an event for "sha256sum" process

      1. stap -v -g ./hack_sha256sum.stp

      2. From a shell, execute "sha256sum" program

      1. sha256sum

      3. stap script shows fake "queue full" event

      q_append(): Event for PID 2437 (sha256sum), FD 11: queue_len=0, num_entries=640
      q_append(): returning -1
      q_append(): restoring queue_length from 640 to 0

      Actual results:

      • sha256sum hangs forever and cannot be killed (even with SIGKILL signal):
      1. cat /proc/2437/stack
        [<0>] fanotify_handle_event+0x306/0x360
        [<0>] fsnotify+0x253/0x580
        [<0>] do_dentry_open+0xce/0x340
        [<0>] path_openat+0x53e/0x14f0
        [<0>] do_filp_open+0x93/0x100
        [<0>] do_sys_open+0x184/0x220
        [<0>] do_syscall_64+0x5b/0x1a0
        [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
      1. kill -9 2437
        --> no effect
      • fapolicyd leaks FD 11
      1. ll /proc/1792/fd
        total 0
        lr-x------. 1 root root 64 Mar 24 08:51 0 -> /dev/null
        lrwx------. 1 root root 64 Mar 24 08:51 1 -> 'socket:[29169]'
        lrwx------. 1 root root 64 Mar 24 08:51 10 -> 'anon_inode:[fanotify]'
        lr-x------. 1 root root 64 Mar 24 08:51 11 -> /etc/ld.so.cache <<<<<<<<<< FD of "target process 2437"
        lrwx------. 1 root root 64 Mar 24 08:51 2 -> 'socket:[29169]'
        lrwx------. 1 root root 64 Mar 24 08:51 3 -> /run/fapolicyd/fapolicyd.fifo
        lr-x------. 1 root root 64 Mar 24 08:51 4 -> /var/lib/sss/mc/passwd
        lrwx------. 1 root root 64 Mar 24 08:51 5 -> 'socket:[29173]'
        lr-x------. 1 root root 64 Mar 24 08:51 6 -> /var/lib/sss/mc/group
        lrwx------. 1 root root 64 Mar 24 08:51 7 -> /var/lib/fapolicyd/lock.mdb
        lrwx------. 1 root root 64 Mar 24 08:51 8 -> /var/lib/fapolicyd/data.mdb
        lr-x------. 1 root root 64 Mar 24 08:51 9 -> /proc/1792/mounts

      Expected results:

      • sha256sum continues execution (or gets denied, I don't know, needs to be evaluated, in Permissive it should continue for sure)
      • fapolicyd doesn't leak FD 11, fapolicyd "complains hard" and authorizes or denies target process execution (in Permissive it should authorize for sure)

      — Additional comment from Renaud Métrich on 2023-03-24 08:24:18 UTC —

      — Additional comment from Renaud Métrich on 2023-03-24 08:35:30 UTC —

      Note: the fact that SIGKILL doesn't have effect makes the system take a long time when trying to reboot it, because systemd itself tries to SIGKILL remaining processes.

      — Additional comment from Flos Qi Guo on 2023-03-24 08:49:18 UTC —

      Just some more ideas:

      1. We need to explicitly log the error message when fapolicyd fails to enqueue events.
      ```
      msg(LOG_DEBUG, "enqueue error");

      ```

      2. It can be dangerous if fapolicyd drops the events if the queue length reaches "q_size".

      I'm attaching the relevant data which was collected when the customer was experiencing the issue. The fapolicyd config files can also be found in the attachment.
      ATTACHMENT: sfdc03423820_20230215.tar.gz

      We also collected a vmcore at that time, please find it at:
      HOST: galvatron-x86.cee.ral3.lab.eng.rdu2.redhat.com
      by: $ retrace-server-interact 707154131 crash

      Best regards,
      Flos

      — Additional comment from Flos Qi Guo on 2023-03-24 08:53:09 UTC —

      — Additional comment from Radovan Sroka on 2023-03-24 15:34:39 UTC —

      Would you try to test my scratchbuild?

      https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=51560259
      https://brew-task-repos.engineering.redhat.com/repos/scratch/rsroka/fapolicyd/1.1.3/12.el8_8.queue_01/fapolicyd-1.1.3-12.el8_8.queue_01-scratch.repo

      I backported these 3 patches:
      https://github.com/linux-application-whitelisting/fapolicyd/commit/2b3116eac595bda24065c3a57b8f8cd9d4045ab5
      https://github.com/linux-application-whitelisting/fapolicyd/commit/cd287d722b82007f35d822f191673f5cd173acbd
      https://github.com/linux-application-whitelisting/fapolicyd/commit/aa12666204d57332053fdb2783b1e598699eff13

      — Additional comment from Flos Qi Guo on 2023-03-25 15:14:11 UTC —

      However, these patches did not show a better way to deal with the failure of enqueue events.

      Best regards,
      Flos

      — Additional comment from Renaud Métrich on 2023-03-27 06:21:36 UTC —

      Hello,

      I can confirm this is somehow better, but as Flos mentions, this is not fully complete.
      I pushed some comments in the commit https://github.com/linux-application-whitelisting/fapolicyd/commit/aa12666204d57332053fdb2783b1e598699eff13

      In particular:

      1. It should log at LOG_WARN at least for admin to know there is something wrong with queue settings
      2. It should allow the operation in Permissive, for now it's failing:

      Stap script output:

      [root@vm-fapolicy8 ~]# stap -v -g ./hack_sha256sum.stp
      ...
      q_append(): Event for PID 5918 (sha256sum), FD 12: queue_len=0, num_entries=640
      q_append(): returning -1
      q_append(): restoring queue_length from 640 to 0
      q_append(): Event for PID 5918 (sha256sum), FD 12: queue_len=0, num_entries=640
      q_append(): returning -1
      q_append(): restoring queue_length from 640 to 0

      Test program (which should not fail because I'm in permissive):

      [root@vm-fapolicy8 ~]# sha256sum
      sha256sum: error while loading shared libraries: libcrypto.so.1.1: cannot open shared object file: Operation not permitted

      — Additional comment from Renaud Métrich on 2023-03-27 06:37:58 UTC —

      — Additional comment from Radovan Sroka on 2023-03-27 12:30:48 UTC —

      I've created PR on upstream and new scratchbuild.

      https://github.com/linux-application-whitelisting/fapolicyd/pull/236
      https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=51598417

      Repo:
      https://brew-task-repos.engineering.redhat.com/repos/scratch/rsroka/fapolicyd/1.1.3/12.el8_8.queue_02/fapolicyd-1.1.3-12.el8_8.queue_02-scratch.repo

      As before there are this 3 patches + patch from PR above:
      https://github.com/linux-application-whitelisting/fapolicyd/commit/2b3116eac595bda24065c3a57b8f8cd9d4045ab5
      https://github.com/linux-application-whitelisting/fapolicyd/commit/cd287d722b82007f35d822f191673f5cd173acbd
      https://github.com/linux-application-whitelisting/fapolicyd/commit/aa12666204d57332053fdb2783b1e598699eff13

      I made one change which is switch message type to error not to info.

      What do you think Renaud|Flos?

      — Additional comment from Renaud Métrich on 2023-03-27 12:45:10 UTC —

      Works better, even in Permissive.

      Maybe "enqueue_event: cannot enqueue event" should be more explicit: "Failed to enqueue event for PID xxx: queue is full, please consider tuning q_size if issue happens often"

      Acceptance Criteria:

      • patch is included and is applied
      • patch review is done
      • sanity tests for the component passes

      Attachments

        Issue Links

          Activity

            People

              dapospis@redhat.com Dalibor Pospíšil
              rsroka@redhat.com Radovan Sroka
              Radovan Sroka Radovan Sroka
              Dalibor Pospíšil Dalibor Pospíšil
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: