-
Bug
-
Resolution: Done-Errata
-
Major
-
rhel-9.2.0
-
kernel-5.14.0-499.el9
-
None
-
Moderate
-
4
-
rhel-sst-kernel-security
-
ssg_core_kernel
-
25
-
26
-
2
-
False
-
-
None
-
CK-May-2024, CK-June-2024, CK-July-2024, CK-August-2024
-
If docs needed, set a value
-
-
All
-
None
Description of problem:
We have a customer constantly getting the auditd service failing to start at boot with the following message:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
... auditd[987]: Error receiving audit netlink packet (No buffer space available)
... auditd[987]: Error setting audit daemon pid (No buffer space available)
... auditd[987]: Unable to set audit pid, exiting
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
This happens even when setting audit_backlog_limit=16384 on the kernel command line.
The service being configured to restart automatically, the next start works fine and auditd seems then to be stable.
ENOBUFS is expected to happen when netlink socket gets unsynchronized, which should not happen here with such large backlog, knowing that no rules are loaded at all yet (we booted the initramfs and are now just after switchroot).
A few seconds before auditd starts (and fail), we can see message suppressions:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
... kernel: kauditd_printk_skb: 161 callbacks suppressed
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
It looks like to me this is closely related: the customer always see suppressions, but doesn't reproduce the issue when there are less than 100 suppressions.
ANALYSIS AT USERSPACE LEVEL:
strace shows AUDIT_SET/AUDIT_STATUS_PID message is sent, then reply is ENOBUFS:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
08:14:30.025976 sendto(3<NETLINK:[AUDIT:1274]>, [
,
"\x04\x00\x00...\x00\xfa\x04\x00..."], 60, 0,
, 12) = 60 <0.000110>
^^^^ ^^^^^^^^
AUDIT_STATUS_PID PID of auditd
:
08:14:30.026106 recvfrom(3<NETLINK:[AUDIT:1274]>, 0x7ffc23e7cd90, 8988, MSG_PEEK|MSG_DONTWAIT, 0x7ffc23e7a9dc, [12]) = -1 ENOBUFS (No buffer space available) <0.000001>
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
We have audit_set_pid() call audit_send() which in turn calls __audit_send(), the latter waiting for Acknowledge through calling check_ack():
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
435 int audit_set_pid(int fd, uint32_t pid, rep_wait_t wmode)
436 {
437 struct audit_status s;
438 struct audit_reply rep;
439 struct pollfd pfd[1];
440 int rc;
441
442 memset(&s, 0, sizeof(s));
443 s.mask = AUDIT_STATUS_PID;
444 s.pid = pid;
445 rc = audit_send(fd, AUDIT_SET, &s, sizeof(s));
446 if (rc < 0)
:
255 int audit_send(int fd, int type, const void *data, unsigned int size)
256 {
:
260 rc = __audit_send(fd, type, data, size, &seq);
:
204 int __audit_send(int fd, int type, const void *data, unsigned int size, int *seq)
205 {
:
239 do
while (retval < 0 && errno == EINTR);
243 if (retval == (int)req.nlh.nlmsg_len)
244 return check_ack(fd);
:
271 static int check_ack(int fd)
272 {
:
288 rc = audit_get_reply(fd, &rep, GET_REPLY_NONBLOCKING, MSG_PEEK);
:
292 } else if (rc < 0)
293 return rc;
:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
We have check_ack() failing with ENOBUFS, which is returned by the kernel.
On the auditd level, I'm convinced auditd is just a victim of kernel's behavior, assuming it's expected the reply to be first on the netlink socket receive buffer.
(EARLY) ANALYSIS AT KERNEL LEVEL:
We have the following function being entered to process the message:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
1204 static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
1205 {
:
1243 case AUDIT_SET: {
1244 struct audit_status s;
:
1258 if (s.mask & AUDIT_STATUS_PID) {
:
1293 if (new_pid)
else {
:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
On line 1295, auditd_set() is called, then later on line 1307 the backlog thread is unblocked.
It's unclear to me who sends the reply back, I don't see this in auditd_set() at all.
audit_send_reply() doesn't seem to be called as well.
I believe there is a race between waking up the backlog thread and sending the reply:
because the backlog buffer would be full, ENOBUFS would be returned somehow, before the AUDIT_SET reply would have been set.
It's just an hypothesis for now, I'm not able to find out how and by whom the reply is actually crafted and sent.
Version-Release number of selected component (if applicable):
RHEL9.2 kernels including latest 5.14.0-284.18.1.el9_2.x86_64
How reproducible:
Frequently on customer's system
I can reproduce the issue on my QEMU/KVM quite reliably, with audit_backlog_limit=320 but definitely not with customer's settings
Steps to Reproduce:
1. Install a QEMU/KVM with minimal profile
1 CPU / 2GB memory
2. Disable automatic startup of auditd service and automatic restart and set some kernel command line parameters
- systemctl disable auditd
- mkdir -p /etc/systemd/system/auditd.service.d
- echo -e "[Service]\nRestart=no" > /etc/systemd/system/auditd.service.d/no_restart.conf
- grubby --update-kernel ALL --args="audit=1 audit_rate_limit=0 audit_backlog_limit=320 log_buf_len=15M ignore_loglevel printk.devkmsg=on"
- reboot
3. Start auditd service by hand just after suppressions were seen on the console at login prompt (~10 seconds after login prompt appears)
$ ssh vm-rhel9 systemctl start auditd
Job for auditd.service failed because the control process exited with error code.
See "systemctl status auditd.service" and "journalctl -xeu auditd.service" for details.
$ ssh vm-rhel9 journalctl -b -u auditd
Jun 29 09:52:20 vm-rhel9 systemd[1]: Starting Security Auditing Service...
Jun 29 09:52:20 vm-rhel9 auditd[1147]: No plugins found, not dispatching events
Jun 29 09:52:20 vm-rhel9 auditd[1147]: Error receiving audit netlink packet (No buffer space available)
Jun 29 09:52:20 vm-rhel9 auditd[1147]: Error setting audit daemon pid (No buffer space available)
Jun 29 09:52:20 vm-rhel9 auditd[1147]: Unable to set audit pid, exiting
Jun 29 09:52:20 vm-rhel9 auditd[1146]: Cannot daemonize (Success)
Jun 29 09:52:20 vm-rhel9 auditd[1146]: The audit daemon is exiting.
Jun 29 09:52:20 vm-rhel9 systemd[1]: auditd.service: Control process exited, code=exited, status=1/FAILURE
Actual results:
See above
Expected results:
auditd started properly
- is duplicated by
-
RHEL-9063 Sometimes, auditd: Error receiving audit netlink packet (No buffer space available) is reported
- Closed
- external trackers
- links to
-
RHSA-2024:128795 kernel bug fix and enhancement update