Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-9096

auditd fails to start with "Error setting audit daemon pid (No buffer space available)"

    • Normal
    • sst_kernel_security
    • ssg_core_kernel
    • 2
    • False
    • Hide

      None

      Show
      None
    • CK-May-2024
    • If docs needed, set a value

      Description of problem:

      We have a customer constantly getting the auditd service failing to start at boot with the following message:
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
      ... auditd[987]: Error receiving audit netlink packet (No buffer space available)
      ... auditd[987]: Error setting audit daemon pid (No buffer space available)
      ... auditd[987]: Unable to set audit pid, exiting
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      This happens even when setting audit_backlog_limit=16384 on the kernel command line.
      The service being configured to restart automatically, the next start works fine and auditd seems then to be stable.

      ENOBUFS is expected to happen when netlink socket gets unsynchronized, which should not happen here with such large backlog, knowing that no rules are loaded at all yet (we booted the initramfs and are now just after switchroot).

      A few seconds before auditd starts (and fail), we can see message suppressions:
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
      ... kernel: kauditd_printk_skb: 161 callbacks suppressed
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      It looks like to me this is closely related: the customer always see suppressions, but doesn't reproduce the issue when there are less than 100 suppressions.

      ANALYSIS AT USERSPACE LEVEL:

      strace shows AUDIT_SET/AUDIT_STATUS_PID message is sent, then reply is ENOBUFS:
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
      08:14:30.025976 sendto(3<NETLINK:[AUDIT:1274]>, [

      {nlmsg_len=60, nlmsg_type=AUDIT_SET, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK, nlmsg_seq=3, nlmsg_pid=0}

      ,
      "\x04\x00\x00...\x00\xfa\x04\x00..."], 60, 0,

      {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}

      , 12) = 60 <0.000110>
      ^^^^ ^^^^^^^^
      AUDIT_STATUS_PID PID of auditd
      :
      08:14:30.026106 recvfrom(3<NETLINK:[AUDIT:1274]>, 0x7ffc23e7cd90, 8988, MSG_PEEK|MSG_DONTWAIT, 0x7ffc23e7a9dc, [12]) = -1 ENOBUFS (No buffer space available) <0.000001>
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      We have audit_set_pid() call audit_send() which in turn calls __audit_send(), the latter waiting for Acknowledge through calling check_ack():

      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
      435 int audit_set_pid(int fd, uint32_t pid, rep_wait_t wmode)
      436 {
      437 struct audit_status s;
      438 struct audit_reply rep;
      439 struct pollfd pfd[1];
      440 int rc;
      441
      442 memset(&s, 0, sizeof(s));
      443 s.mask = AUDIT_STATUS_PID;
      444 s.pid = pid;
      445 rc = audit_send(fd, AUDIT_SET, &s, sizeof(s));
      446 if (rc < 0)

      { 447 audit_msg(audit_priority(errno), 448 "Error setting audit daemon pid (%s)", 449 strerror(-rc)); 450 return rc; 451 }

      :

      255 int audit_send(int fd, int type, const void *data, unsigned int size)
      256 {
      :
      260 rc = __audit_send(fd, type, data, size, &seq);
      :

      204 int __audit_send(int fd, int type, const void *data, unsigned int size, int *seq)
      205 {
      :
      239 do

      { 240 retval = sendto(fd, &req, req.nlh.nlmsg_len, 0, 241 (struct sockaddr*)&addr, sizeof(addr)); 242 }

      while (retval < 0 && errno == EINTR);
      243 if (retval == (int)req.nlh.nlmsg_len)
      244 return check_ack(fd);
      :

      271 static int check_ack(int fd)
      272 {
      :
      288 rc = audit_get_reply(fd, &rep, GET_REPLY_NONBLOCKING, MSG_PEEK);
      :
      292 } else if (rc < 0)
      293 return rc;
      :
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      We have check_ack() failing with ENOBUFS, which is returned by the kernel.
      On the auditd level, I'm convinced auditd is just a victim of kernel's behavior, assuming it's expected the reply to be first on the netlink socket receive buffer.

      (EARLY) ANALYSIS AT KERNEL LEVEL:

      We have the following function being entered to process the message:
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
      1204 static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
      1205 {
      :
      1243 case AUDIT_SET: {
      1244 struct audit_status s;
      :
      1258 if (s.mask & AUDIT_STATUS_PID) {
      :
      1293 if (new_pid)

      { 1294 /* register a new auditd connection */ 1295 err = auditd_set(req_pid, 1296 NETLINK_CB(skb).portid, 1297 sock_net(NETLINK_CB(skb).sk)); 1298 if (audit_enabled != AUDIT_OFF) 1299 audit_log_config_change("audit_pid", 1300 new_pid, 1301 auditd_pid, 1302 err ? 0 : 1); 1303 if (err) 1304 return err; 1305 1306 /* try to process any backlog */ 1307 wake_up_interruptible(&kauditd_wait); 1308 }

      else {
      :
      -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

      On line 1295, auditd_set() is called, then later on line 1307 the backlog thread is unblocked.
      It's unclear to me who sends the reply back, I don't see this in auditd_set() at all.
      audit_send_reply() doesn't seem to be called as well.

      I believe there is a race between waking up the backlog thread and sending the reply:
      because the backlog buffer would be full, ENOBUFS would be returned somehow, before the AUDIT_SET reply would have been set.

      It's just an hypothesis for now, I'm not able to find out how and by whom the reply is actually crafted and sent.

      Version-Release number of selected component (if applicable):

      RHEL9.2 kernels including latest 5.14.0-284.18.1.el9_2.x86_64

      How reproducible:

      Frequently on customer's system
      I can reproduce the issue on my QEMU/KVM quite reliably, with audit_backlog_limit=320 but definitely not with customer's settings

      Steps to Reproduce:
      1. Install a QEMU/KVM with minimal profile

      1 CPU / 2GB memory

      2. Disable automatic startup of auditd service and automatic restart and set some kernel command line parameters

      1. systemctl disable auditd
      2. mkdir -p /etc/systemd/system/auditd.service.d
      3. echo -e "[Service]\nRestart=no" > /etc/systemd/system/auditd.service.d/no_restart.conf
      4. grubby --update-kernel ALL --args="audit=1 audit_rate_limit=0 audit_backlog_limit=320 log_buf_len=15M ignore_loglevel printk.devkmsg=on"
      5. reboot

      3. Start auditd service by hand just after suppressions were seen on the console at login prompt (~10 seconds after login prompt appears)

      $ ssh vm-rhel9 systemctl start auditd
      Job for auditd.service failed because the control process exited with error code.
      See "systemctl status auditd.service" and "journalctl -xeu auditd.service" for details.

      $ ssh vm-rhel9 journalctl -b -u auditd
      Jun 29 09:52:20 vm-rhel9 systemd[1]: Starting Security Auditing Service...
      Jun 29 09:52:20 vm-rhel9 auditd[1147]: No plugins found, not dispatching events
      Jun 29 09:52:20 vm-rhel9 auditd[1147]: Error receiving audit netlink packet (No buffer space available)
      Jun 29 09:52:20 vm-rhel9 auditd[1147]: Error setting audit daemon pid (No buffer space available)
      Jun 29 09:52:20 vm-rhel9 auditd[1147]: Unable to set audit pid, exiting
      Jun 29 09:52:20 vm-rhel9 auditd[1146]: Cannot daemonize (Success)
      Jun 29 09:52:20 vm-rhel9 auditd[1146]: The audit daemon is exiting.
      Jun 29 09:52:20 vm-rhel9 systemd[1]: auditd.service: Control process exited, code=exited, status=1/FAILURE

      Actual results:

      See above

      Expected results:

      auditd started properly

            rhn-support-rbriggs Richard Guy Briggs
            rhn-support-rmetrich Renaud Metrich
            Richard Guy Briggs Richard Guy Briggs
            Dennis Li Dennis Li
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: