Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-79158

Disk cache failure with large db sizes

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • rhel-10.0.z
    • rhel-9.4
    • sssd
    • sssd-2.10.2-3.el10_0.2
    • No
    • Important
    • 0day
    • rhel-idm-sssd
    • ssg_idm
    • 9
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • All
    • None

      What were you trying to do that didn't work?

      In a relatively large AD deployment with provider = ldap and are severely affected by this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1886492

      What is the impact of this issue to you?

      Tried to mitigate the issue with:

      ignore_group_members = true
      lower values of entry_cache_timeout, entry_cache_user_timeout, entry_cache_group_timeout
      lower value of ldap_purge_cache_timeout in conjunction with 2.
      ldap_group_search_base filtering when possible
      Despite this we are still running into cases where certain hosts that see access from a higher number of users (nfs servers) grow the database too quickly despite the optimizations above, this is the current performance for a user lookup when memcache expires, and it gets progressively worse until it can't return queries anymore:

      id user, db 22M -> 7.0s
      id user, db 43M -> 14s
      id user, db 100M -> 30s

      Please provide the package NVR for which the bug is seen:

      yum list installed | grep sssd
      python3-sssdconfig.noarch 2.9.4-6.el9_4.1 @BaseOS
      sssd.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-ad.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-client.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-common.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-common-pac.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-dbus.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-ipa.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-kcm.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-krb5.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-krb5-common.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-ldap.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-nfs-idmap.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-proxy.x86_64 2.9.4-6.el9_4.1 @BaseOS
      sssd-tools.x86_64 2.9.4-6.el9_4.1 @BaseOS

      How reproducible is this bug?:

      Steps to reproduce

      We were counting on purging the disk cache frequently enough with ldap_purge_cache_timeout, but we found that is that once the db reaches a certain size, the ldap purge process is unable to complete (as if it times out, there does not seem to be any detailed information even with the highest debug level on the ldap backend). So the db does not shrink, and the purge process also is a blocking operation that hangs queries while it runs, so running it frequently is less than ideal.

      Ultimately with db growing further, sssd becomes unresponsive and the only way to recover is to delete the disk cache manually and restart the service.

      We understand that the disk cache performance might be related to missing indexes as specified in https://bugzilla.redhat.com/show_bug.cgi?id=1886492 but it's not clear why this was marked as CLOSED WONTFIX or if there is a plan to resolve.

      Expected results:

      Would it be acceptable to have an option to disable the disk cache completely and rely exclusively on memcache, but if that is not supported currently? Or if the cache purge timeout issue can be resolved that would also help.

              atikhono@redhat.com Alexey Tikhonov
              rhn-support-shas Shajith Arul Simon
              Alexey Tikhonov Alexey Tikhonov
              Shridhar Gadekar Shridhar Gadekar
              Louise McGarry Louise McGarry
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated:
                Resolved: