-
Bug
-
Resolution: Done-Errata
-
Major
-
rhel-9.4
-
sssd-2.10.2-3.el10_0.2
-
No
-
Important
-
0day
-
rhel-idm-sssd
-
ssg_idm
-
9
-
False
-
False
-
-
None
-
None
-
Pass
-
RegressionOnly
-
Unspecified
-
Unspecified
-
Unspecified
-
-
All
-
None
What were you trying to do that didn't work?
In a relatively large AD deployment with provider = ldap and are severely affected by this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1886492
What is the impact of this issue to you?
Tried to mitigate the issue with:
ignore_group_members = true
lower values of entry_cache_timeout, entry_cache_user_timeout, entry_cache_group_timeout
lower value of ldap_purge_cache_timeout in conjunction with 2.
ldap_group_search_base filtering when possible
Despite this we are still running into cases where certain hosts that see access from a higher number of users (nfs servers) grow the database too quickly despite the optimizations above, this is the current performance for a user lookup when memcache expires, and it gets progressively worse until it can't return queries anymore:
id user, db 22M -> 7.0s
id user, db 43M -> 14s
id user, db 100M -> 30s
Please provide the package NVR for which the bug is seen:
yum list installed | grep sssd
python3-sssdconfig.noarch 2.9.4-6.el9_4.1 @BaseOS
sssd.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-ad.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-client.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-common.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-common-pac.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-dbus.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-ipa.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-kcm.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-krb5.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-krb5-common.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-ldap.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-nfs-idmap.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-proxy.x86_64 2.9.4-6.el9_4.1 @BaseOS
sssd-tools.x86_64 2.9.4-6.el9_4.1 @BaseOS
How reproducible is this bug?:
Steps to reproduce
We were counting on purging the disk cache frequently enough with ldap_purge_cache_timeout, but we found that is that once the db reaches a certain size, the ldap purge process is unable to complete (as if it times out, there does not seem to be any detailed information even with the highest debug level on the ldap backend). So the db does not shrink, and the purge process also is a blocking operation that hangs queries while it runs, so running it frequently is less than ideal.
Ultimately with db growing further, sssd becomes unresponsive and the only way to recover is to delete the disk cache manually and restart the service.
We understand that the disk cache performance might be related to missing indexes as specified in https://bugzilla.redhat.com/show_bug.cgi?id=1886492 but it's not clear why this was marked as CLOSED WONTFIX or if there is a plan to resolve.
Expected results:
Would it be acceptable to have an option to disable the disk cache completely and rely exclusively on memcache, but if that is not supported currently? Or if the cache purge timeout issue can be resolved that would also help.
- links to
-
RHBA-2025:148258 sssd update