Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-76748

ns-slapd crashes with data directory ≥ 2 days old

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • rhel-9.6
    • rhel-9.6
    • 389-ds-base
    • None
    • 389-ds-base-2.6.1-2.el9
    • Yes
    • Important
    • rhel-idm-ds
    • ssg_idm
    • 26
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • No
    • Red Hat Enterprise Linux
    • None
    • Hide

      Automated tests should pass:

      dirsrvtests/tests/suites/logging/log_flush_rotation_test.py::test_log_flush_and_rotation_crash 
      
      Show
      Automated tests should pass: dirsrvtests/tests/suites/logging/log_flush_rotation_test.py::test_log_flush_and_rotation_crash
    • Pass
    • Automated
    • Unspecified Release Note Type - Unknown
    • x86_64
    • None

      What were you trying to do that didn't work?

      For testing Cockpit, we run the quay.io/freeipa/freeipa-server:centos-9-stream container (on a Fedora CoreOS host, but that hopefully shouldn't matter). That VM gets refreshed every month.

      That most recently happened in https://github.com/cockpit-project/bots/pull/7342 4 days ago. At the time of the refresh, all the test were fine. But two days later, all our FreeIPA related tests started to fail and we reverted to the previous image. That PR has quite a bit of debugging investigation. At that time I suspected a mis-build of the data directory volume, but couldn't reproduce it using a fresh container build.

      Then in https://github.com/cockpit-project/bots/pull/7351 I reattempted another VM/container refresh, and once again it had worked for two days, until this morning everything
      starts to fail again.

      We get errors like "ipa: ERROR: Failed to authenticate to CA REST API" and the journal shows a crash of ns-slapd:

      kernel: ns-slapd[1813]: segfault at aaaaaac2 ip 00007fa36504f367 sp 00007fa3501f45a8 error 4 in libnspr4.so[f367,7fa36504c000+25000] likely on CPU 0 (core 0, socket 0)
      systemd-coredump[2282]: Process 1800 (ns-slapd) of user 389 dumped core.
      

      Unfortunately the stack trace is useless, as the crash happens in a container.

      What is the impact of this issue to you?

      FreeIPA deployment stops working after about two days.

      Please provide the package NVR for which the bug is seen:

      https://github.com/cockpit-project/bots/pull/7350#issuecomment-2615093896 has a complete rpm -qa diff between the previously working and failing image. Given that it's ns-slapd that crashes, the biggest suspect is

      -389-ds-base-2.5.2-2.el9.x86_64
      -389-ds-base-libs-2.5.2-2.el9.x86_64
      +389-ds-base-2.6.0-2.el9.x86_64
      +389-ds-base-libs-2.6.0-2.el9.x86_64
      

      How reproducible is this bug?

      Always

      Steps to reproduce

      This is what our infra does, minus two handfuls of port redirections (which are of course critical for actually using it, but not important for reproducing the bug). During building the VM image which hosts the FreeIPA container, it does this:

      mkdir -p /var/lib/ipa-data
      podman run -it --rm --name freeipa -h f0.cockpit.lan -v /sys/fs/cgroup:/sys/fs/cgroup:ro -v /var/lib/ipa-data:/data:Z -e IPA_SERVER_IP=10.111.112.100 quay.io/freeipa/freeipa-server:centos-9-stream -U -p foobarfoo -a foobarfoo -n cockpit.lan -r COCKPIT.LAN --setup-dns --no-forwarders --no-ntp
      

      Wait about 8 minutes until "Configure IPA server upon the first start" is done. Then do some more setup in

      podman exec -it freeipa bash
      

      and in the container, run:

      echo foobarfoo | kinit admin@COCKPIT.LAN
      ipa pwpolicy-mod --minlife=0 --maxlife=1000
      # Change password to apply new password policy
      printf "foobarfoo\nfoobarfoo\n" | ipa user-mod --password admin
      # Allow "admins" IPA group members to run sudo
      ipa-advise enable-admins-sudo | sh -ex
      ipa dnsconfig-mod --forwarder=8.8.8.8
      poweroff
      

      (I don't know how much of this is necessary to reproduce the bug).

      Now you can re-start the container using the same podman command. As the data dir/volume is initialized, it only takes some 10 to 20 seconds until "FreeIPA server started" appears, and the container works.

      Now wait for two days (you can fast-forward the system clock by 3 days, see below). After that, starting the container will soon trigger the ns-slapd crash:

      # podman exec -it freeipa systemctl --failed
        UNIT                       LOAD   ACTIVE SUB    DESCRIPTION                      
      ● dirsrv@COCKPIT-LAN.service loaded failed failed 389 Directory Server COCKPIT-LAN.
      
      # podman exec -it freeipa systemctl status dirsrv@COCKPIT-LAN.service
      × dirsrv@COCKPIT-LAN.service - 389 Directory Server COCKPIT-LAN.
           Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled; preset: disabled)
          Drop-In: /usr/lib/systemd/system/dirsrv@.service.d
                   └─custom.conf
                   /data/etc/systemd/system/dirsrv@COCKPIT-LAN.service.d
                   └─ipa-env.conf
           Active: failed (Result: core-dump) since Wed 2025-01-29 07:49:28 UTC; 57s ago
         Duration: 31.966s
          Process: 148 ExecStartPre=/usr/libexec/dirsrv/ds_systemd_ask_password_acl /etc/dirsrv/slapd-COCKPIT-LAN/dse.ldif (code=exited, status=0/SUCCESS)
          Process: 153 ExecStartPre=/usr/libexec/dirsrv/ds_selinux_restorecon.sh /etc/dirsrv/slapd-COCKPIT-LAN/dse.ldif (code=exited, status=0/SUCCESS)
          Process: 158 ExecStart=/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-COCKPIT-LAN -i /run/dirsrv/slapd-COCKPIT-LAN.pid (code=dumped, signal=SEGV)
         Main PID: 158 (code=dumped, signal=SEGV)
           Status: "slapd started: Ready to process requests"
              CPU: 2.170s
      
      Jan 29 07:48:56 f0.cockpit.lan ns-slapd[158]: [29/Jan/2025:07:48:56.107940754 +0000] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 389 for LDAP requests
      Jan 29 07:48:56 f0.cockpit.lan ns-slapd[158]: [29/Jan/2025:07:48:56.131173566 +0000] - INFO - slapd_daemon - Listening on All Interfaces port 636 for LDAPS requests
      Jan 29 07:48:56 f0.cockpit.lan ns-slapd[158]: [29/Jan/2025:07:48:56.135474608 +0000] - INFO - slapd_daemon - Listening on /run/slapd-COCKPIT-LAN.socket for LDAPI requests
      Jan 29 07:48:56 f0.cockpit.lan systemd[1]: Started 389 Directory Server COCKPIT-LAN..
      Jan 29 07:49:00 f0.cockpit.lan ns-slapd[158]: [29/Jan/2025:07:49:00.939879499 +0000] - ERR - schema-compat-plugin - warning: no entries set up under cn=ng, cn=compat,dc=cockpit,dc=lan
      Jan 29 07:49:00 f0.cockpit.lan ns-slapd[158]: [29/Jan/2025:07:49:00.947313349 +0000] - ERR - schema-compat-plugin - warning: no entries set up under cn=computers, cn=compat,dc=cockpit,dc=lan
      Jan 29 07:49:00 f0.cockpit.lan ns-slapd[158]: [29/Jan/2025:07:49:00.948034746 +0000] - ERR - schema-compat-plugin - Finished plugin initialization.
      Jan 29 07:49:28 f0.cockpit.lan systemd[1]: dirsrv@COCKPIT-LAN.service: Main process exited, code=dumped, status=11/SEGV
      Jan 29 07:49:28 f0.cockpit.lan systemd[1]: dirsrv@COCKPIT-LAN.service: Failed with result 'core-dump'.
      Jan 29 07:49:28 f0.cockpit.lan systemd[1]: dirsrv@COCKPIT-LAN.service: Consumed 2.170s CPU time.
      

      and the crash is visible in the journal. This happens at different times, sometimes it takes a minute or three. Either it already crashes during container setup, or when trying to talk to it:

      # podman exec -it freeipa sh -exc 'echo foobarfoo | kinit -f admin; ipa user-find'
      + echo foobarfoo
      + kinit -f admin
      kinit: Generic error (see e-text) while getting initial credentials
      

      To avoid having to wait for two days, or having to suffer through the 8 mins of first-time initialization, I attached a tarball of /var/lib/ipa-data here. You can unpack it with

      tar -C /var/lib -xvf /tmp/ipa-data.tar.xz
      

              rhn-engineering-mareynol Mark Reynolds
              rhn-engineering-mpitt Martin Pitt
              IdM DS Dev IdM DS Dev
              Viktor Ashirov Viktor Ashirov
              Evgenia Martyniuk Evgenia Martyniuk
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: