Loading...

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: dirsrv-11.8
Affects Version/s: dirsrv-11.8
Component/s: 389-ds-base
Labels:
- Triaged

Severity:
None

AssignedTeam:
rhel-idm-ds

Story Points:
0
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Bugzilla Bug:
RHBZ: 2268183

PX Impact Score:
PX Priority Data:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

+++ This bug was initially created as a clone of Bug #2268177 +++

+++ This bug was initially created as a clone of Bug #2234613 +++

Description of problem:

issue to investigate a RHDS-11 long etime situation with error "Retry cound exceeded' on BIND/ADD/DEL/MOD from revert_cache ( entry cache )

Version-Release number of selected component (if applicable):

RHDS-11.7 on RHEL-8.8
389-ds-base-1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
redhat-release-8.8-0.8.el8.x86_64

How reproducible:
N/A, high traffic, and other unknowns in environment.

Steps to Reproduce:
1. N/A
2.
3.

Actual results:

pattern event in errors log:

ERR - find_entry_internal_dn - Retry count exceeded (uid=

thread signature:
Contention on backend lock while reverting TXN failure
Many threads (update) are stucked waiting for backend lock => can contribute to worker starvation

Thread 36 (Thread 0x7f16839fe700 (LWP 1672443)):
#0 0x00007f191177ee92 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#1 0x00007f191177f103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#2 0x00007f19117aea7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007f19204604d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
#4 0x00007f192046112b in modify_internal_pb () at target:/usr/lib64/dirsrv/libslapd.so.0
#5 0x00007f1920486479 in pw_apply_mods () at target:/usr/lib64/dirsrv/libslapd.so.0
#6 0x00007f1920486686 in set_retry_cnt_and_time.constprop () at target:/usr/lib64/dirsrv/libslapd.so.0
#7 0x00007f19204867fb in update_pw_retry () at target:/usr/lib64/dirsrv/libslapd.so.0
#8 0x00007f192048c2cb in send_ldap_result_ext () at target:/usr/lib64/dirsrv/libslapd.so.0
#9 0x00007f192048c54f in send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#10 0x00007f1920473267 in slapi_send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#11 0x00007f191179c01b in ldbm_back_bind () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#12 0x0000562dfa8c4ed2 in pw_verify_be_dn ()
#13 0x0000562dfa8b1449 in do_bind ()
#14 0x0000562dfa8b64b5 in connection_threadmain ()
#15 0x00007f191ce97968 in _pt_root () at target:/lib64/libnspr4.so
#16 0x00007f191c8321cf in start_thread () at target:/lib64/libpthread.so.0
#17 0x00007f191eae5dd3 in clone () at target:/lib64/libc.so.6

Expected results:
yes

Additional info:

RHDS-11.6 related fix: bz 2051476 - high contention in find_entry_internal_dn on mixed load
https://bugzilla.redhat.com/2051476
https://access.redhat.com/errata/RHBA-2023:0186
"
Cause: Cache c_mutex type was changed from PR_Monitor to pthread recursive mutex implementation. It brought a minor performance boost but also proved to be a less stable solution in its current way.
Additionally, another issue happens when updating the parent entry of a deleted entry (numsubordinates), if it fails to lock the parent it does not return the parent entry.

Consequence: "find_entry_internal_dn - Retry count exceeded" error appears in the error log with high concurrent mixed operations load on a flat tree.
And when the other issue happens, refcnt becomes invalid. Which may lead to other cache locking issues.

Fix: Change cache c_mutex type to PR_Monitor.
In the case of the failure to lock the parent entry, the entry should be returned.

Result: "find_entry_internal_dn - Retry count exceeded" error doesn't appear. And the cache structure exists in the correct state with the correct refcnt.
"

so
ERR - find_entry_internal_dn - Retry count exceeded
will happen again, and there have been more reports
https://bugzilla.redhat.com/show_bug.cgi?id=2051476#c45

— Additional comment from Marc Sauton on 2023-08-24 22:01:43 UTC —

relaying an entry cache config change from ~40GB to ~5GB providing a significant improvement of the long etimes as of 2023-08-24, from around 400+ seconds for some operations to max 40 seconds:

1370-sosreport-dc1frhdsm4-2023-08-24-kapwqqa.tar.xz
etc/dirsrv/slapd-dc1frhdsm4/dse.ldif

dn: cn=userroot,cn=ldbm database,cn=plugins,cn=config
nsslapd-suffix: dc=gst,dc=gov,dc=in
nsslapd-cachesize: -1
nsslapd-cachememsize: 5000000000
nsslapd-dncachememsize: 13421772800
nsslapd-directory: /dirsrv/lib/dirsrv/slapd-dc1frhdsm4/db/userroot

relaying internal comment from salesforce case number 03549218

Comment: #449 (private)
Author: Bordaz, Thierry
Date: Thu Aug 24 05:14:48 EDT 2023

// Internal

A hotfix containing the fix https://github.com/389ds/389-ds-base/issues/5906
was applied but did not solve the CU concern.
Indeed the purpose of that hotfix is to prevent 'Retry cound exceeded' during
BIND operations.
It appeared that 'Retry count exceeded' also occurred on ADD/DEL/MOD and so
trigger revert_cache and then high etime.

ATM it is not clear if the hotfix (although failing to address all CU concern)
improved the situation on BIND. In clear no more 'Retry count' during bind.
Need logs from host where it was applied

Because of severe production impact and need of an urgent relief
The CU accepted to test another tentative fix (1.4.3.34-1_03549218_2):
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=54848419

The purpose of this fix is to allow configuration of the backend/suffix
so that, upon DB failure ('Retry count exceeded'), the entry cache of the suffix
is flushed only if the configuration allows it.

By default, upon DB failure the server flushes the entry cache. To disable it
edit 'dse.ldif' and set:
dn: cn=userRoot,cn=ldbm database,cn=plugins,cn=config
...
enable-revert-cache: off
...

In conclusion:
The RC of the CU concern ('Retry Count exceeded') is not identified yet

The CU accepted a temporary relief hotfix, in the hope it will
reduce symptoms (high wtime/etime/cpu, replication lag)

Continue investigation on 'Retry Count exceeded'
propose to the CU the hotfix https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=54848419

Comment: #407 (private)
Author: Bordaz, Thierry
Date: Mon Aug 21 12:16:24 EDT 2023

// Internal

It looks that all "Retry count exceeded" are related to BIND
1270-dc1frhdsm_logs.tar.gz
Aug 18 18:00:14 dc1frhdsm1 rhds_ac: [18/Aug/2023:17:59:46.664077434 +051800] conn=18485 op=1 BIND dn="uid=maddy_singh,ou=fo,dc=gst,dc=gov,dc=in" method=128 version=3
Aug 18 18:00:04 dc1frhdsm1 rhds_er: [18/Aug/2023:18:00:01.393392006 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=maddy_singh,ou=fo,dc=gst,dc=gov,dc=in)
Aug 18 18:00:04 dc1frhdsm1 rhds_er: [18/Aug/2023:18:00:02.808523263 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=maddy_singh,ou=fo,dc=gst,dc=gov,dc=in'
Aug 18 18:00:14 dc1frhdsm1 rhds_ac: [18/Aug/2023:18:00:02.808770227 +051800] conn=18485 op=1 RESULT err=49 tag=97 nentries=0 wtime=9.138938420 optime=16.144697570 etime=25.283634330 - Invalid credentials

Aug 18 18:00:14 dc1frhdsm1 rhds_ac: [18/Aug/2023:17:59:48.024406848 +051800] conn=18505 op=1 BIND dn="uid=DIECASTING,ou=fo,dc=gst,dc=gov,dc=in" method=128 version=3
Aug 18 18:00:04 dc1frhdsm1 rhds_er: [18/Aug/2023:18:00:02.846230004 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=DIECASTING,ou=fo,dc=gst,dc=gov,dc=in)
Aug 18 18:00:14 dc1frhdsm1 rhds_er: [18/Aug/2023:18:00:04.297126996 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=DIECASTING,ou=fo,dc=gst,dc=gov,dc=in'
Aug 18 18:00:14 dc1frhdsm1 rhds_ac: [18/Aug/2023:18:00:04.297419708 +051800] conn=18505 op=1 RESULT err=49 tag=97 nentries=0 wtime=10.498126923 optime=16.273025243 etime=26.771150546 - Invalid credentials

Aug 18 18:00:14 dc1frhdsm1 rhds_ac: [18/Aug/2023:17:59:48.108226770 +051800] conn=18526 op=1 BIND dn="uid=Staneysanthosh_,ou=fo,dc=gst,dc=gov,dc=in" method=128 version=3
Aug 18 18:00:14 dc1frhdsm1 rhds_er: [18/Aug/2023:18:00:04.329979528 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=Staneysanthosh_,ou=fo,dc=gst,dc=gov,dc=in)
Aug 18 18:00:14 dc1frhdsm1 rhds_er: [18/Aug/2023:18:00:05.785225927 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=Staneysanthosh_,ou=fo,dc=gst,dc=gov,dc=in'
Aug 18 18:00:14 dc1frhdsm1 rhds_ac: [18/Aug/2023:18:00:05.785534732 +051800] conn=18526 op=1 RESULT err=49 tag=97 nentries=0 wtime=10.578120021 optime=17.677315442 etime=28.255433989 - Invalid credentials

I found something suspicious in the way BIND udpates retyCount upon failure
The update of the retryCount is done by backend BIND (ldbm_back_bind)
that initially retrieves the target entry from the entry cache
but without locking it.
So if the backend code finally decides to update the target entry (retryCount)
there is a possibility that entry has been removed from the entry cache.
In such a case it triggers, cache flush that trigger the others symptoms
If it triggers the failure then it is a new bug.

Thread 59 (Thread 0x7f6e66bff700 (LWP 2233131)):
#0 0x00007f7138c4ad52 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#2 0x00007f7138c7aa7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007f7146fb74d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
#4 0x00007f7146fb812b in modify_internal_pb () at target:/usr/lib64/dirsrv/libslapd.so.0
#5 0x00007f7146fdd479 in pw_apply_mods () at target:/usr/lib64/dirsrv/libslapd.so.0
#6 0x00007f7146fdd686 in set_retry_cnt_and_time.constprop () at target:/usr/lib64/dirsrv/libslapd.so.0
#7 0x00007f7146fdd7fb in update_pw_retry () at target:/usr/lib64/dirsrv/libslapd.so.0
#8 0x00007f7146fe32cb in send_ldap_result_ext () at target:/usr/lib64/dirsrv/libslapd.so.0
#9 0x00007f7146fe354f in send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#10 0x00007f7146fca267 in slapi_send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#11 0x00007f7138c6801b in ldbm_back_bind () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#12 0x000055bafd1ffed2 in pw_verify_be_dn ()
#13 0x000055bafd1ec449 in do_bind ()
#14 0x000055bafd1f14b5 in connection_threadmain ()
#15 0x00007f71439ed968 in _pt_root () at target:/lib64/libnspr4.so
#16 0x00007f71433881ca in start_thread () at target:/lib64/libpthread.so.0
#17 0x00007f714563be73 in clone () at target:/lib64/libc.so.6

The account lockout (passwordLockout) is required to hit this issue
a workaround could be to disable passwordLockout (if acceptable)

RH suggested to use large entry cache (140Gb)

In conclusion:
Failure like
[18/Aug/2023:15:19:03.797038047 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=POWERTRONICS_20,ou=fo,dc=gst,dc=gov,dc=in'
[18/Aug/2023:15:37:03.904214436 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=vipulmalan_1981,ou=fo,dc=gst,dc=gov,dc=in)
Are triggered during BIND if the password policy lockout (passwordLockout) is enabled.
A workaround could be to disable passwordLockout.

The RC is possibly identified and related to the fact that
bind backend callback does not lock the target entry for modification.
It opens the possibility that the entry is removed from the cache by the time
the bind completes and sends result.

Others approaches (below) that I proposed are on hold ATM as my recent finding
is more promissing

debug fix to prevent entry cache flushing
reduce entrycache 140Gb->5Gb

Next steps

Ask CU if disabling passwordLockout could be an acceptable temporary workaround
Ask CU if they would accept to run a tentative fix on a server where
the problem is reproducible (need to open upstream ticket)
Investigate how to reproduce 2051476.
Could you ask the CU to try, on one of its production server, to reduce a lot the entry cache

Comment: #403 (private)
Author: Bordaz, Thierry
Date: Mon Aug 21 06:29:02 EDT 2023

// Internal

Could you confirm the entry cache setting (config and runtime)
sosreport-dc1frhdsm1-2023-08-18-cxqwjzn
Config (140Gb)
dn: cn=userroot,cn=ldbm database,cn=plugins,cn=config
nsslapd-cachememsize: 144686710784

'cache flush' duration is related to size of the cache

If cache size is confirmed (140Gb) it looks that huge entry
cache can not prevent an entry to get out of cache during
the processing of the entry.
[18/Aug/2023:15:19:03.797038047 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=POWERTRONICS_20,ou=fo,dc=gst,dc=gov,dc=in'
[18/Aug/2023:15:37:03.904214436 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=vipulmalan_1981,ou=fo,dc=gst,dc=gov,dc=in)

In Conclusion
Big entry cache can not prevent 'Retry count exceeded' but
trigger expensive cache flush.
Having a smaller entry cache will likely not prevent 'Retry count exceeded'
but will reduce the duration of flushing.
At the moment I think it worth testing with a smaller entry cache
to see if:

operation response time are still good
Even if 'Retry count exceeded' occurs but can prevent
high cpu, high etime, high wtime.

Next steps

could you provide logs (access/errors/audit?) from dc1frhdsm1 that covers 18/Aug/2023:17:00 -> 18/Aug/2023:18:00
Ask CU if they would accept to run a debug fix, that would allow to disable (on demand)
the hash flushing. This debug fix would be for rapid workaround.
Investigate how to reproduce 2051476.
Could you ask the CU to try, on one of its production server, to reduce a lot the entry cache
down to 5Gb (five) and monitor perf/lag/etime/wtime

Comment: #399 (private)
Author: Bordaz, Thierry
Date: Mon Aug 21 04:27:03 EDT 2023

// Internal

To elaborate a little more about recent cache 'flush_hash' finding.
Flushing hash, because of its cost, could contribute to

High CPU
Many operations with high etime, because they are blocked
(access to the cache during the flush, backend locked (TBC))
Worker starvation (wtime) in case all workers are blocked
by flush
replication lag, because of worker starvation or high etime

There is currently no way to disable flushing hash
This mechanism is quite new and although it is valid it may create
issues if there are too many DB retry and hash flush

The important RC to identify is why CU is seeing so frequent
this symptom was identified few years ago (https://bugzilla.redhat.com/show_bug.cgi?id=2051476)
After a long analyze and test we believed we have fixed it (it is fixed in
CU release see Renaud update Thu, Aug 17, 2023, 02:54:19 PM GMT+2)
[18/Aug/2023:15:19:03.797038047 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=POWERTRONICS_20,ou=fo,dc=gst,dc=gov,dc=in'
[18/Aug/2023:15:37:03.904214436 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=vipulmalan_1981,ou=fo,dc=gst,dc=gov,dc=in)

In conclusion:
The case reported a large set of symptoms and it is likely that the RC of some of them (above) is 'flushing hash'.
The only immediate workaround would be to implement a debug fix to disable hash flush.
The important RC to (re)fix is 2051476. However it is complex and likely will not be found rapidely

Next steps:

could you provide logs (access/errors/audit?) from dc1frhdsm1 that covers 18/Aug/2023:17:00 -> 18/Aug/2023:18:00
Ask CU if they would accept to run a debug fix, that would allow to disable (on demand)
the hash flushing. This debug fix would be for rapid workaround.
Investigate how to reproduce 2051476.

Comment: #393 (private)
Author: Bordaz, Thierry
Date: Sat Aug 19 04:50:47 EDT 2023

// Internal

CU faced high etime on dc1frhdsm1 18/Aug/2023:17:54
wtime shows that all workers were busy that explained high etime (agree with Teko)
[root@dc1frhdsm1 ~]# tail -100 /var/log/dirsrv/slapd-dc1frhdsm1/access | grep -i etime
[18/Aug/2023:17:54:34.828467423 +051800] conn=17321 op=1 RESULT err=0 tag=101 nentries=1 wtime=29.522406119 optime=0.000079053 etime=29.522483608

on 17:54:20, because of a TXN failure the entry cache was reverted
1240-dc1frhdsm1-pstack.tar.gz/var/tmp/debug_data/pstack.230818:175420
Thread 59 (Thread 0x7f6e66bff700 (LWP 2233131)):
#0 0x00007f7138c4ad52 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#2 0x00007f7138c7aa7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007f7146fb74d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
#4 0x00007f7146fb812b in modify_internal_pb () at target:/usr/lib64/dirsrv/libslapd.so.0
#5 0x00007f7146fdd479 in pw_apply_mods () at target:/usr/lib64/dirsrv/libslapd.so.0
#6 0x00007f7146fdd686 in set_retry_cnt_and_time.constprop () at target:/usr/lib64/dirsrv/libslapd.so.0
#7 0x00007f7146fdd7fb in update_pw_retry () at target:/usr/lib64/dirsrv/libslapd.so.0
#8 0x00007f7146fe32cb in send_ldap_result_ext () at target:/usr/lib64/dirsrv/libslapd.so.0
#9 0x00007f7146fe354f in send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#10 0x00007f7146fca267 in slapi_send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#11 0x00007f7138c6801b in ldbm_back_bind () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#12 0x000055bafd1ffed2 in pw_verify_be_dn ()
#13 0x000055bafd1ec449 in do_bind ()
#14 0x000055bafd1f14b5 in connection_threadmain ()
#15 0x00007f71439ed968 in _pt_root () at target:/lib64/libnspr4.so
#16 0x00007f71433881ca in start_thread () at target:/lib64/libpthread.so.0
#17 0x00007f714563be73 in clone () at target:/lib64/libc.so.6

During period 17:54->18:... we can see a lot of threads (different operation)
reverting the entry cache
grep revert_cache var/tmp/debug_data/pstack.230818*
var/tmp/debug_data/pstack.230818:175420:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175422:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175425:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175430:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175432:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175435:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175437:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175440:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
var/tmp/debug_data/pstack.230818:175442:#1 0x00007f7138c4b103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so

CU is seeing a lot of messages like
[18/Aug/2023:15:19:03.797038047 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=POWERTRONICS_20,ou=fo,dc=gst,dc=gov,dc=in'
[18/Aug/2023:15:37:03.904214436 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=vipulmalan_1981,ou=fo,dc=gst,dc=gov,dc=in)

In conclusion:
It is early analyze but I suspect that, for unknown reason (db_deadlock?),
many BIND/update_retry_count failed.
This trigger the revert of entry cache, while holding the database/suffix lock. Revertion is
expensive and long. So during that time many threads are blocked while trying to access the entry cache
or needing the database/suffix lock. When all workers are busy (revert and blocked) wtime increase

could you provide logs (access/errors/audit?) from dc1frhdsm1 that covers 18/Aug/2023:17:00 -> 18/Aug/2023:18:00
Investigate if we can prevent revert/update_pw_retry to workaround the issue

Comment: #76 (private)
Author: Bordaz, Thierry
Date: Mon Jul 10 10:48:19 EDT 2023

// Internal

dc1frhdsm1 Contention on entry cache while Reverting TXN failure
Many threads are stucked waiting for entry_cache access => can contribute to
worker starvation

0260-pstack.tar.gz/dc1frhdsm1/pstack4.out
0260-pstack.tar.gz/dc1frhdsm1/pstack.out
Thread 20 (Thread 0x7fd0bfbff700 (LWP 3314052)):
#0 0x00007fd3504beb67 in slapi_timespec_diff () at target:/usr/lib64/dirsrv/libslapd.so.0
#1 0x00007fd34177eeb0 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#2 0x00007fd34177f103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007fd3417aea7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#4 0x00007fd3504734d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
#5 0x00007fd350474be0 in do_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
#6 0x00005621be8943db in connection_threadmain ()
#7 0x00007fd34ceaa968 in _pt_root () at target:/lib64/libnspr4.so
#8 0x00007fd34c8451cf in start_thread () at target:/lib64/libpthread.so.0
#9 0x00007fd34eaf8dd3 in clone () at target:/lib64/libc.so.6

on dc1frhdsm1 I suspect those failures to
[10/Jul/2023:15:51:36.753034761 +051800] conn=178356 op=80 MOD dn="uid=GST9703J,ou=fo,dc=gst,dc=gov,dc=in"
[10/Jul/2023:15:51:46.697301426 +051800] conn=178356 op=80 RESULT err=50 tag=103 nentries=0 wtime=0.049391536 optime=9.944267266 etime=9.993656836

[10/Jul/2023:15:51:36.752039167 +051800] conn=181291 op=12 MOD dn="uid=SAIHAPUR,ou=fo,dc=gst,dc=gov,dc=in"
[10/Jul/2023:15:51:41.774817987 +051800] conn=181291 op=12 RESULT err=50 tag=103 nentries=0 wtime=0.048401166 optime=5.022779441 etime=5.071178945

[10/Jul/2023:15:47:47.442062904 +051800] conn=179953 op=14 RESULT err=19 tag=103 nentries=0 wtime=0.160102581 optime=4.580975764 etime=4.741077024
[10/Jul/2023:15:47:47.442079643 +051800] conn=179953 op=14 MOD dn="uid=V_03AAFFP8534R1,ou=fo,dc=gst,dc=gov,dc=in", invalid password syntax

dc1frhdsm4 Contention on backend lock while reverting TXN failure
Many threads (update) are stucked waiting for backend lock => can contribute to
worker starvation
Thread 36 (Thread 0x7f16839fe700 (LWP 1672443)):
#0 0x00007f191177ee92 in flush_hash () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#1 0x00007f191177f103 in revert_cache () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#2 0x00007f19117aea7c in ldbm_back_modify () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007f19204604d0 in op_shared_modify () at target:/usr/lib64/dirsrv/libslapd.so.0
#4 0x00007f192046112b in modify_internal_pb () at target:/usr/lib64/dirsrv/libslapd.so.0
#5 0x00007f1920486479 in pw_apply_mods () at target:/usr/lib64/dirsrv/libslapd.so.0
#6 0x00007f1920486686 in set_retry_cnt_and_time.constprop () at target:/usr/lib64/dirsrv/libslapd.so.0
#7 0x00007f19204867fb in update_pw_retry () at target:/usr/lib64/dirsrv/libslapd.so.0
#8 0x00007f192048c2cb in send_ldap_result_ext () at target:/usr/lib64/dirsrv/libslapd.so.0
#9 0x00007f192048c54f in send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#10 0x00007f1920473267 in slapi_send_ldap_result () at target:/usr/lib64/dirsrv/libslapd.so.0
#11 0x00007f191179c01b in ldbm_back_bind () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#12 0x0000562dfa8c4ed2 in pw_verify_be_dn ()
#13 0x0000562dfa8b1449 in do_bind ()
#14 0x0000562dfa8b64b5 in connection_threadmain ()
#15 0x00007f191ce97968 in _pt_root () at target:/lib64/libnspr4.so
#16 0x00007f191c8321cf in start_thread () at target:/lib64/libpthread.so.0
#17 0x00007f191eae5dd3 in clone () at target:/lib64/libc.so.6

on dc1frhdsm4, a symptom of the reverted are related to those weird msg
Unfortunately I can not find the access log in the sos-report
[10/Jul/2023:15:56:49.695621687 +051800] - WARN - pw_apply_mods - Modify error 32 on entry 'uid=montymor,ou=fo,dc=gst,dc=gov,dc=in'
[10/Jul/2023:15:56:49.746466428 +051800] - ERR - find_entry_internal_dn - Retry count exceeded (uid=pitchaigst_,ou=fo,dc=gst,dc=gov,dc=in)

DB policy does not give priority to writers
dn: cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config
nsslapd-db-deadlock-policy: 9

In conclusion:
It looks like, during direct or internal update, DB access can fail (for unknown reason) and trigger
revert of the entry cache. Revert task creates a contention that can contribute to worker
starvation.

It looks very surprising that revert could create such contention

It is possible that switching the DB policy to prio on write
could mitigate the issue.

Next steps:

Need to spend more time to analyze the data and get a better understand
what is happening, why it happens only now, what changed...
Propose to the CU to change the DB prio and monitor if it is
beneficial.

— Additional comment from Kentaro Nakai on 2023-08-24 23:03:40 UTC —

///

Enabled the customer escalation flag to YES according to the escalation filed as ACE EN-65113[1] and watch-list entry CAP-79[2]
[1]: https://gss--rh-escalimpl.vf.force.com/apex/Escalation_View?id=a3n6R00000Ceeii&sfdc.override=1
[2]: https://issues.redhat.com/browse/CAP-79
Also confirmed the severity/priority set as urgent accordingly

///

— Additional comment from Marc Sauton on 2023-08-24 23:33:45 UTC —

Thierry or LDAP team, should we try to have an instrumental debug build to help us collect more detailed log events to allow to better correlate those many events between the errors log file and the access log events, to allow to detect which exact operation this is about?
like may be add a connection ID to the errors log event "ERR - find_entry_internal_dn - Retry count exceeded" ? ( may add some load by looking up a large connection table ? )
or mix into the access log some of the error events?

— Additional comment from Marc Sauton on 2023-08-24 23:38:19 UTC —

like trying to add a way in the errors events to have a reference with a particular access log event, otherwise hard to correlate
or have a custom stap probe triggered on "ERR - find_entry_internal_dn - Retry count exceeded" to try collect more connection and operation details, may be stack info?

— Additional comment from Marc Sauton on 2023-08-25 00:10:47 UTC —

correction about the exact RHDS version, module and 389-ds-base package used in salesforce case number 03549218 :

it looks like this is RHDS-11.7 on RHEL-8.8 with 389-ds-base-1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
although I seem to see 2 diiferent versions available for RHDS-11.7 and 11.6.

details:

1370-sosreport-dc1frhdsm4-2023-08-24-kapwqqa.tar.xz
sos_commands/dnf/dnf_-C_repolist

dirsrv-11-for-rhel-8-x86_64-rpms Red Hat Directory Server 11 for RHEL 8 x86_64 (RPMs)
dirsrv-11.7-for-rhel-8-x86_64-rpms Red Hat Directory Server 11.7 for RHEL 8 x86_64 (RPMs)

1370-sosreport-dc1frhdsm4-2023-08-24-kapwqqa.tar.xz
sos_commands/dnf/dnf_~~C_repolist_~~-verbose

Repo-id : dirsrv-11-for-rhel-8-x86_64-rpms
Repo-name : Red Hat Directory Server 11 for RHEL 8 x86_64 (RPMs)
Repo-revision : 1673945484
Repo-updated : Tue Jan 17 14:21:24 2023
Repo-pkgs : 91
Repo-available-pkgs: 91
Repo-size : 78 M
Repo-baseurl : https://dc1rhn.gst.gov.in/pulp/repos/GST/Library/content/dist/layered/rhel8/x86_64/dirsrv/11/os
Repo-expire : 1 second(s) (last: Thu Aug 24 14:27:57 2023)
Repo-filename : /etc/yum.repos.d/redhat.repo

Repo-id : dirsrv-11.7-for-rhel-8-x86_64-rpms
Repo-name : Red Hat Directory Server 11.7 for RHEL 8 x86_64 (RPMs)
Repo-revision : 1684834029
Repo-updated : Tue May 23 14:57:09 2023
Repo-pkgs : 7
Repo-available-pkgs: 7
Repo-size : 6.8 M
Repo-baseurl : https://dc1rhn.gst.gov.in/pulp/repos/GST/Library/content/dist/layered/rhel8/x86_64/dirsrv/11.7/os
Repo-expire : 1 second(s) (last: Thu Aug 24 14:27:57 2023)
Repo-filename : /etc/yum.repos.d/redhat.repo

1370-sosreport-dc1frhdsm4-2023-08-24-kapwqqa.tar.xz
sos_commands/dnf/dnf_module_info_redhat-ds
...snip...
Name : redhat-ds
Stream : 11 [d][e][a]
Version : 8080020230330140549
Context : f969626e
Architecture : x86_64
Profiles : default [d] [i], legacy, minimal
Default profiles : default
Repo : dirsrv-11.7-for-rhel-8-x86_64-rpms
Summary : Red Hat Directory Server 11.7
Description : The Red Hat Directory Server is an LDAPv3 compliant server.
Requires : platform:[el8]
Artifacts : 389-ds-base-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.src
: 389-ds-base-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-debuginfo-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-debugsource-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-devel-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-legacy-tools-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-legacy-tools-debuginfo-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-libs-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-libs-debuginfo-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-snmp-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: 389-ds-base-snmp-debuginfo-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.x86_64
: cockpit-389-ds-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.noarch
: python3-lib389-0:1.4.3.34-1.module+el8dsrv+18528+22f7779f.noarch

Hint: [d]efault, [e]nabled, [x]disabled, [i]nstalled, [a]ctive
(END)

1370-sosreport-dc1frhdsm4-2023-08-24-kapwqqa.tar.xz
sos_commands/dnf/dnf_history

Updating Subscription Management repositories.
ID | Command line | Date and time | Action(s) | Altered
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
16 | localupdate 389-ds-base-1.4.3.34-1_03549218_1.module+el8dsrv+18380+8350b80e.src.rpm 389-ds-base-1.4.3.34-1_035492 | 2023-08-24 09:29 | Upgrade | 5
15 | | 2023-08-14 21:26 | Install | 1 <
14 | update -y | 2023-07-11 00:30 | C, E, I, U | 700 >E
13 | remove leapp* | 2023-07-11 00:29 | Removed | 5 <
12 | update --nobest | 2023-04-21 17:52 | Upgrade | 4 >E
11 | remove abrt-console-notification-2.10.9-21.el8.x86_64 -y | 2023-04-18 17:06 | Removed | 1 <
10 | update --nobest | 2023-01-24 07:26 | C, E, I, U | 128 >E
9 | update --nobest | 2022-11-10 03:40 | C, E, I, O, U | 639 E<
8 | update -y | 2022-07-27 04:48 | Upgrade | 4 ><
7 | update | 2022-06-03 16:42 | E, I, O, U | 683 ><
6 | module install redhat-ds:11 -y | 2022-02-07 15:20 | Install | 23 >
5 | module remove redhat-ds | 2022-02-07 15:13 | Removed | 23 <
4 | update -y | 2022-01-10 19:24 | E, I, O, U | 963 >E
3 | install iotop | 2022-01-10 12:00 | Install | 1 <
2 | remove ds_agent-20.0.0-2971.el7.x86_64 | 2021-10-16 08:29 | Removed | 1 >E
1 | module install redhat-ds:11 | 2021-10-15 17:43 | Install | 19 EE

1370-sosreport-dc1frhdsm4-2023-08-24-kapwqqa.tar.xz
sos_commands/dnf/dnf_module_list

Updating Subscription Management repositories.
Red Hat Directory Server 11 for RHEL 8 x86_64 ( 17 kB/s | 4.5 kB 00:00
Red Hat Directory Server 11.7 for RHEL 8 x86_64 17 kB/s | 4.5 kB 00:00
Red Hat Directory Server 11 for RHEL 8 x86_64 (RPMs)
Name Stream Profiles Summary
redhat-ds 11 [d][e] default [d] [i], legacy, minimal Red Hat Directory Server 11.6

Red Hat Directory Server 11.7 for RHEL 8 x86_64 (RPMs)
Name Stream Profiles Summary
redhat-ds 11 [d][e] default [d] [i], legacy, minimal Red Hat Directory Server 11.7

Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs)
Name Stream Profiles Summary
389-ds 1.4 389 Directory Server (base)
...snip...

1370-sosreport-dc1frhdsm4-2023-08-24-kapwqqa.tar.xz
sos_commands/dnf/dnf_module_list_--installed

Updating Subscription Management repositories.
Red Hat Directory Server 11 for RHEL 8 x86_64 ( 14 kB/s | 4.5 kB 00:00
Red Hat Directory Server 11.7 for RHEL 8 x86_64 15 kB/s | 4.5 kB 00:00
Red Hat Directory Server 11 for RHEL 8 x86_64 (RPMs)
Name Stream Profiles Summary
redhat-ds 11 [d][e] default [d] [i], legacy, minimal Red Hat Directory Server 11.6

Red Hat Directory Server 11.7 for RHEL 8 x86_64 (RPMs)
Name Stream Profiles Summary
redhat-ds 11 [d][e] default [d] [i], legacy, minimal Red Hat Directory Server 11.7

Hint: [d]efault, [e]nabled, [x]disabled, [i]nstalled

— Additional comment from Marc Sauton on 2023-08-25 19:41:33 UTC —

internal notes

there is something odd with the experimental fix 1420-389-ds-base-1.4.3.34-1_03549218_3.tar.gz ( and 2 ) , there are only 7 threads, the other connections are not processed, we need to backout this.

sample:

salesforce case number 03549218
1430-dc1frhdsm3-pstack_2.tar.gz
var/tmp/debug_data/pstack.230825:223347

Thread 7 (Thread 0x7fbe843ff700 (LWP 154535)):
#0 0x00007fbf854a545c in pthread_cond_wait@@GLIBC_2.3.2 () at target:/lib64/libpthread.so.0
#1 0x00007fbf8910ca5d in slapi_wait_condvar_pt () at target:/usr/lib64/dirsrv/libslapd.so.0
#2 0x00007fbf7b151659 in cos_cache_wait_on_change () at target:/usr/lib64/dirsrv/plugins/libcos-plugin.so
#3 0x00007fbf85b04968 in _pt_root () at target:/lib64/libnspr4.so
#4 0x00007fbf8549f1ca in start_thread () at target:/lib64/libpthread.so.0
#5 0x00007fbf87749e73 in clone () at target:/lib64/libc.so.6
Thread 6 (Thread 0x7fbe865fe700 (LWP 154534)):
#0 0x00007fbf8783667f in select () at target:/lib64/libc.so.6
#1 0x00007fbf891218d4 in DS_Sleep () at target:/usr/lib64/dirsrv/libslapd.so.0
#2 0x00007fbf7a334b7c in locks_monitoring_threadmain () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007fbf85b04968 in _pt_root () at target:/lib64/libnspr4.so
#4 0x00007fbf8549f1ca in start_thread () at target:/lib64/libpthread.so.0
#5 0x00007fbf87749e73 in clone () at target:/lib64/libc.so.6
Thread 5 (Thread 0x7fbe86dff700 (LWP 154533)):
#0 0x00007fbf8783667f in select () at target:/lib64/libc.so.6
#1 0x00007fbf891218d4 in DS_Sleep () at target:/usr/lib64/dirsrv/libslapd.so.0
#2 0x00007fbf7a334a14 in perf_threadmain () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007fbf85b04968 in _pt_root () at target:/lib64/libnspr4.so
#4 0x00007fbf8549f1ca in start_thread () at target:/lib64/libpthread.so.0
#5 0x00007fbf87749e73 in clone () at target:/lib64/libc.so.6
Thread 4 (Thread 0x7fbe8796b700 (LWP 154532)):
#0 0x00007fbf7d95ffee in __memp_purge_dead_and_count () at target:/lib64/libdb-5.3.so
#1 0x00007fbf7d96030b in __memp_trickle_pp () at target:/lib64/libdb-5.3.so
#2 0x00007fbf7a33493f in trickle_threadmain () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007fbf85b04968 in _pt_root () at target:/lib64/libnspr4.so
#4 0x00007fbf8549f1ca in start_thread () at target:/lib64/libpthread.so.0
#5 0x00007fbf87749e73 in clone () at target:/lib64/libc.so.6
Thread 3 (Thread 0x7fbe8816c700 (LWP 154531)):
#0 0x00007fbf8783667f in select () at target:/lib64/libc.so.6
#1 0x00007fbf891218d4 in DS_Sleep () at target:/usr/lib64/dirsrv/libslapd.so.0
#2 0x00007fbf7a336bc3 in checkpoint_threadmain () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007fbf85b04968 in _pt_root () at target:/lib64/libnspr4.so
#4 0x00007fbf8549f1ca in start_thread () at target:/lib64/libpthread.so.0
#5 0x00007fbf87749e73 in clone () at target:/lib64/libc.so.6
Thread 2 (Thread 0x7fbe8896d700 (LWP 154530)):
#0 0x00007fbf8783667f in select () at target:/lib64/libc.so.6
#1 0x00007fbf891218d4 in DS_Sleep () at target:/usr/lib64/dirsrv/libslapd.so.0
#2 0x00007fbf7a334367 in deadlock_threadmain () at target:/usr/lib64/dirsrv/plugins/libback-ldbm.so
#3 0x00007fbf85b04968 in _pt_root () at target:/lib64/libnspr4.so
#4 0x00007fbf8549f1ca in start_thread () at target:/lib64/libpthread.so.0
#5 0x00007fbf87749e73 in clone () at target:/lib64/libc.so.6
Thread 1 (Thread 0x7fbf89ae64c0 (LWP 154524)):
#0 0x00007fbf7d8f9505 in __dbc_iget () at target:/lib64/libdb-5.3.so
#1 0x00007fbf7d908369 in __dbc_get_pp () at target:/lib64/libdb-5.3.so
#2 0x00007fbf79a0fa8b in _cl5GetNextEntry () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3 0x00007fbf79a1010c in _cl5ReadRUV () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#4 0x00007fbf79a103bc in _cl5DBOpenFileByReplicaName () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#5 0x00007fbf79a11304 in _cl5DBOpen () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#6 0x00007fbf79a116fd in _cl5Open () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#7 0x00007fbf79a11950 in cl5Open () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#8 0x00007fbf79a189f7 in changelog5_init () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#9 0x00007fbf79a2f7e2 in multimaster_start () at target:/usr/lib64/dirsrv/plugins/libreplication-plugin.so
#10 0x00007fbf890e5f97 in plugin_call_func () at target:/usr/lib64/dirsrv/libslapd.so.0
#11 0x00007fbf890e9129 in plugin_startall () at target:/usr/lib64/dirsrv/libslapd.so.0
#12 0x000055988394ec2f in main ()

and the
1430-dc1frhdsm3-pstack_2.tar.gz
var/tmp/debug_data/top.230825:223347

top - 22:33:47 up 8:34, 2 users, load average: 1.20, 1.36, 1.35
Threads: 7 total, 1 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.3 us, 0.8 sy, 0.0 ni, 96.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 515372.6 total, 42851.6 free, 4684.5 used, 467836.5 buff/cache
MiB Swap: 65536.0 total, 65498.5 free, 37.5 used. 507188.3 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
154524 dirsrv 20 0 5697896 4.4g 3.1g R 99.9 0.9 14:48.28 ns-slapd
154530 dirsrv 20 0 5697896 4.4g 3.1g S 0.0 0.9 0:00.10 ns-slapd
154531 dirsrv 20 0 5697896 4.4g 3.1g S 0.0 0.9 0:00.00 ns-slapd
154532 dirsrv 20 0 5697896 4.4g 3.1g S 0.0 0.9 3:31.99 ns-slapd
154533 dirsrv 20 0 5697896 4.4g 3.1g S 0.0 0.9 0:00.00 ns-slapd
154534 dirsrv 20 0 5697896 4.4g 3.1g S 0.0 0.9 0:02.76 ns-slapd
154535 dirsrv 20 0 5697896 4.4g 3.1g S 0.0 0.9 0:00.00 ns-slapd

— Additional comment from Marc Sauton on 2023-08-25 20:57:47 UTC —

so may be I misunderstood the newer situation, relaying an internal post in an internal channel
"
Final update for today:
Concluded the call with DC1FRHDSM2 patched. Now Infosys team will just monitor the etimes (For M1, M2 and M4) over the weekend.
No config changes planned over the weekend.
Action items:
We will need to analyse the DC1FRHDSM3 situation. Currently out of production, but if we can identify the issue for huge time taken to start and stop dirsrv service.
Collected pstacks, sosreport and dirsrv logs tarball. Paras should upload it on the case. Collaboration will be needed over
Collaboration will be needed over the weekend or till Monday if we have SME bandwidth. Expectations have been set accordingly.
"

the experimental hotfix 3 installation fails on a NSS dependency, because the customer's system is not fully up to date with RHEL8.8

nothing provides nss >= 3.90.0 needed by 389-ds-base-1.4.3.34-1_03549218_3.module+el8dsrv+18528+22f7779f.x86_64
->
nss-3.79 was released in RHEL-8.8.0, and this is what the customer's system has.
but Red Hat published 2 erratas for RHEL-8.8.0.1 and RHEL-8.8.0.2 with nss-3.90.0-2 and 3.90.0-3 that have been released, and this is what the hotfix is built with, in the requirements and dependencies, so the customer's system need to be updated to get that hotfix version 3.

DC1FRHDSM2 has fix 2 installed, but the behavior is unknown, possibly working fine, so the experimental hotfix may be good, will know more on Monday.

DC1FRHDSM3 has fix 2 installed, and does not work well, I initially got confused by the pstacks showing a small number if LDAP threads, but I suspect this is because DC1FRHDSM3 was not in PROD when the pstacks were collected.
what is happening is during the stop order, there is till a running changelog trimming going on:

[25/Aug/2023:22:22:37.291846928 +051800] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...

and this changelog is quite large at ~340GB :

1450-sosreport-dc1frhdsm3-2023-08-25-vkegfui.tar.xz
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ns-slapd 154524 389 24u REG 253,16 343828422656 201328322 /dirsrv/lib/dirsrv/slapd-dc1frhdsm3/changelogdb/1c245583-880511ec-92adedea-f8dfd7d7_584026fe000000010000.db

strangely, there is also a trace of a non graceful termination before that:
[25/Aug/2023:22:18:45.874038925 +051800] - NOTICE - bdb_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database.

— Additional comment from Marc Sauton on 2023-08-25 22:35:29 UTC —

actually, this RHEL-8.8.x nss dependency is very confusing:

this is what I can see in the public site / customer portal using my credentials:

Downloads | Red Hat Enterprise Linux Packages | nss
https://access.redhat.com/downloads/content/rhel---8/x86_64/7441/nss/3.79.0-11.el8_7/x86_64/fd431d51/package
which is showing the most recent rpm as: nss-3.79.0-11.el8_7 [latest]
available from
Product (Variant, Version, Architecture) Repo Label
Red Hat Enterprise Linux for x86_64 8 x86_64 rhel-8-for-x86_64-appstream-rpms <<<<<<<<<

and there is also a
Downloads | Red Hat Packages | nss
https://access.redhat.com/downloads/content/nss/3.90.0-3.el8_8/x86_64/fd431d51/package
which is showing the most recent RHEL-8 rpm as: nss-3.90.0-3.el8_8
->
but available from a BETA channel:
Product (Variant, Version, Architecture) Repo Label
Red Hat Enterprise Linux for x86_64 Beta 8 x86_64 rhel-8-for-x86_64-appstream-beta-rpms <<<<<<<<<< !!! ??

— Additional comment from Marc Sauton on 2023-08-28 18:54:37 UTC —

internal notes

relaying details from internal discussions

the initial scratch build and rpm files for RHDS-11.6 without module
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=54848419
rhel-8.7.0-candidate, 389-ds-base-1.4.3.34-1_03549218_2.el8.src.rpm
This scratch build is now deployed on 3 out 4 of the production server (11.6 and 11.7).
to be safe, we should create module hotfix for 11.6 and 11.7
If support/CU agree, the plan is to substitute the scratch build with the module build

Viktor Ashirov prepared an updated fix built against nss-3.79 ( Thanks )
can be delivered, was successfully installed on RHEL-8.8 and ran basic sanity tests
https://file.emea.redhat.com/~vashirov/share/389-ds-base-1.4.3.34-1_03549218_3_nss-3.79.tar.gz

edited notes from Viktor Ashirov comments:

on why the previous build has a higher NSS 3.90 version dependency:
There is a new release of NSS, not yet not released: https://errata.devel.redhat.com/advisory/116242
BUT, it was pushed to buildroot already, so a brew build is taking nss-3.90.0-3.el8_8

on a possible solution in the build process for hotfixes:
One possible solution that I see is to have brew tags that correspond to the exact batch update release,
i.e. instead of rhel-8.8.0-z-candidate, which is essentially a rolling tag, there should be rhel-8.8.2-z-candidate (where 2 is the batch update number).
It should contain only released packages, so it would match the customer's environment.

on RHDS-11 RHEL-8 modular builds:
the module hotfix situation is complicated.
AFAIK, there is no official way to make them, local module scratch builds do not work.
And the only way forward I see is to demodularize RHDS and have 389-ds-base in RHEL only (as it was in RHEL 6 and 7).
Martin Kosek:
modularity can certainly add a lot of complexity - this is why IdM itself demodularized in RHEL-9+.
As for redhat-ds module in RHEL- {8,9}
, we will need to come up with some modular way, unfortunately.

— Additional comment from Marc Sauton on 2023-08-28 19:07:11 UTC —

relaying positive customer feedback about the hotfix from salesforce case number 03549218 comment 638
I believe this is with 389-ds-base-1.4.3.34-1_03549218_2" on both DC1FRHDSM3 and DC1FRHDSM2 since last Friday 25th ( DC1M3 is till not in the prod pool )
and this looks like just very few long LDAP BIND operations and 1 password error.
I do not know what was the LDAP traffic load and pattern at that moment, but this fix appears promising.

"
As discussed, we have not seen high etime greater than 1 in last 2 days except below 2 entries.

[root@dc1frhdsm1 slapd-dc1frhdsm1]# cat access* | grep -i etime=[1-9]
[27/Aug/2023:10:02:29.445001327 +051800] conn=136145 op=1 RESULT err=0 tag=101 nentries=24116224 wtime=0.000057881 optime=3747.770989481 etime=3747.771042993 notes=A details="Fully Unindexed Filter"
[28/Aug/2023:10:07:09.954192840 +051800] conn=200029 op=1 RESULT err=0 tag=101 nentries=24117007 wtime=0.000101056 optime=4028.825119415 etime=4028.825218332 notes=A details="Fully Unindexed Filter"

[root@dc1frhdsm2 ~]# cd /var/log/dirsrv/slapd-dc1frhdsm2/
[root@dc1frhdsm2 slapd-dc1frhdsm2]# cat access* | grep -i etime=[1-9]
[27/Aug/2023:10:02:35.265837765 +051800] conn=102080 op=1 RESULT err=0 tag=101 nentries=24116219 wtime=0.000113834 optime=3753.758404790 etime=3753.758515626 notes=A details="Fully Unindexed Filter"
[28/Aug/2023:10:03:35.595212981 +051800] conn=165911 op=1 RESULT err=0 tag=101 nentries=24117007 wtime=0.000088722 optime=3813.648223684 etime=3813.648308824 notes=A details="Fully Unindexed Filter"

[root@dc1frhdsm4 slapd-dc1frhdsm4]# cat access* | grep -i etime=[1-9]
[27/Aug/2023:10:58:21.865971221 +051800] conn=175671 op=1 RESULT err=0 tag=101 nentries=24115986 wtime=0.000108583 optime=7100.669851341 etime=7100.669955788 notes=U details="Partially Unindexed Filter"
[27/Aug/2023:11:53:57.055386357 +051800] conn=183963 op=34 RESULT err=49 tag=97 nentries=0 wtime=0.000078770 optime=1.273718759 etime=1.273796500 - Invalid credentials
[28/Aug/2023:11:07:33.850593500 +051800] conn=239491 op=1 RESULT err=0 tag=101 nentries=24116768 wtime=0.000121899 optime=7652.314505473 etime=7652.314623653 notes=U details="Partially Unindexed Filter"
[root@dc1frhdsm4 slapd-dc1frhdsm4]#
"

— Additional comment from thierry bordaz on 2023-08-29 13:48:57 UTC —

(In reply to Marc Sauton from comment #10)
> relaying positive customer feedback about the hotfix from salesforce case
> number 03549218 comment 638
> I believe this is with 389-ds-base-1.4.3.34-1_03549218_2" on both DC1FRHDSM3
> and DC1FRHDSM2 since last Friday 25th ( DC1M3 is till not in the prod pool )
> and this looks like just very few long LDAP BIND operations and 1 password
> error.
> I do not know what was the LDAP traffic load and pattern at that moment, but
> this fix appears promising.
>
> "
> As discussed, we have not seen high etime greater than 1 in last 2 days
> except below 2 entries.
>
> [root@dc1frhdsm1 slapd-dc1frhdsm1]# cat access* | grep -i etime=[1-9]
> [27/Aug/2023:10:02:29.445001327 +051800] conn=136145 op=1 RESULT err=0
> tag=101 nentries=24116224 wtime=0.000057881 optime=3747.770989481
> etime=3747.771042993 notes=A details="Fully Unindexed Filter"
> [28/Aug/2023:10:07:09.954192840 +051800] conn=200029 op=1 RESULT err=0
> tag=101 nentries=24117007 wtime=0.000101056 optime=4028.825119415
> etime=4028.825218332 notes=A details="Fully Unindexed Filter"
>
> [root@dc1frhdsm2 ~]# cd /var/log/dirsrv/slapd-dc1frhdsm2/
> [root@dc1frhdsm2 slapd-dc1frhdsm2]# cat access* | grep -i etime=[1-9]
> [27/Aug/2023:10:02:35.265837765 +051800] conn=102080 op=1 RESULT err=0
> tag=101 nentries=24116219 wtime=0.000113834 optime=3753.758404790
> etime=3753.758515626 notes=A details="Fully Unindexed Filter"
> [28/Aug/2023:10:03:35.595212981 +051800] conn=165911 op=1 RESULT err=0
> tag=101 nentries=24117007 wtime=0.000088722 optime=3813.648223684
> etime=3813.648308824 notes=A details="Fully Unindexed Filter"
>
> [root@dc1frhdsm4 slapd-dc1frhdsm4]# cat access* | grep -i etime=[1-9]
> [27/Aug/2023:10:58:21.865971221 +051800] conn=175671 op=1 RESULT err=0
> tag=101 nentries=24115986 wtime=0.000108583 optime=7100.669851341
> etime=7100.669955788 notes=U details="Partially Unindexed Filter"
> [27/Aug/2023:11:53:57.055386357 +051800] conn=183963 op=34 RESULT err=49
> tag=97 nentries=0 wtime=0.000078770 optime=1.273718759 etime=1.273796500 -
> Invalid credentials
> [28/Aug/2023:11:07:33.850593500 +051800] conn=239491 op=1 RESULT err=0
> tag=101 nentries=24116768 wtime=0.000121899 optime=7652.314505473
> etime=7652.314623653 notes=U details="Partially Unindexed Filter"
> [root@dc1frhdsm4 slapd-dc1frhdsm4]#
> "

Just a comment about those long searches. They are scanning the all list of users that returns 24M of them (unindexed search notes=A). I believe such search is done intentionally. I see no interest of those kind of searches (returning 24M entries with all their attributes) but it is a stress for entry cache, network, possibly DB as it goes through all DB pages,...)

— Additional comment from Colum Gaynor on 2024-03-05 14:22:11 UTC —

@tbordaz@redhat.com ---> VMO2 will need a backport of this for RHDS 11.7 if possible as they are
just after upgrading infra and this issue is affecting the production system ...

Colum Gaynor - Senior Customer Success, VMO2 Account

— Additional comment from thierry bordaz on 2024-03-05 14:39:23 UTC —

@cgaynor@redhat.com , @renaud,
The bug around flush_cache shows several symptoms (LDAP unresponsive, slow operations, weird log messages,..) what is the main concern from VMO2 ?
I am asking because there are others related fix #5939, #5944 and possibly #5921.

IIRC correctly the backport should not be difficult.

— Additional comment from Renaud Marigny on 2024-03-05 16:11:00 UTC —

Hi Thierry,

They are concerned by the increasing etimes they observe at some period of times (which is AFAIKS correlated with some DELs incoming on non-existing entries)

etimes go up to 10 seconds, then we observe ABANDON requests probably due to LDAP clients having a timeout configured to 10s

A snippet of log shows this

29/Feb/2024:21:52:23.864252486 +0000] conn=1827 op=1 DEL dn="uid=user447749872340,ou=mobility,dc=o2,dc=com"
[29/Feb/2024:21:52:23.866235484 +0000] conn=1832 op=2 DEL dn="uid=user447835921088,ou=mobility,dc=o2,dc=com"
[29/Feb/2024:21:52:23.866834597 +0000] conn=1830 op=1 DEL dn="uid=user447849281028,ou=mobility,dc=o2,dc=com"
[29/Feb/2024:21:52:23.867152805 +0000] conn=1829 op=1 DEL dn="uid=user447885270900,ou=mobility,dc=o2,dc=com"
[29/Feb/2024:21:52:24.417070766 +0000] conn=1824 op=1 RESULT err=32 tag=107 nentries=0 wtime=0.000290873 optime=0.555942189 etime=0.556226777
[29/Feb/2024:21:52:25.698128782 +0000] conn=1831 op=1 RESULT err=32 tag=107 nentries=0 wtime=0.000088774 optime=1.836856017 etime=1.836941288
[29/Feb/2024:21:52:26.912513825 +0000] conn=1825 op=1 RESULT err=32 tag=107 nentries=0 wtime=0.000250418 optime=3.051244711 etime=3.051488280
[29/Feb/2024:21:52:28.151985741 +0000] conn=1826 op=1 RESULT err=32 tag=107 nentries=0 wtime=0.000149207 optime=4.290248985 etime=4.290392365
[29/Feb/2024:21:52:29.461197640 +0000] conn=1828 op=1 RESULT err=32 tag=107 nentries=0 wtime=0.000307273 optime=5.598000608 etime=5.598301525
[29/Feb/2024:21:52:30.899805349 +0000] conn=1827 op=1 RESULT err=32 tag=107 nentries=0 wtime=0.000145119 optime=7.035538570 etime=7.035675570

Talking about impact for customer :

Currently the impact is high but not critical cause we seem to be lucky :

Customer said : It is fortunate that most of the slow response times seem to be in the quiet hours (after 18:00 or before 08:00) otherwise we could see larger queues building

Hope it answers your question

Thanks,
Renaud

— Additional comment from thierry bordaz on 2024-03-05 17:22:45 UTC —

For another CU (nokia) we will need to release 11.7z That could be 11.7.6 (GA March 19th [1]) or 11.7.7 GA April 30th
Else it will be via a hotfix

[1] https://pp.engineering.redhat.com/rhel-8-8.0.z/schedule/tasks/#maintenance_phase.batch_update_6_eus

— Additional comment from Colum Gaynor on 2024-03-06 04:39:25 UTC —

@tbordaz@redhat.com + @rmarigny@redhat.com

Thierry - It's exactly as Renaud explained in his previous comment.
They are disturbed by the elongated etimes which are seen to occur when there is a spate of attempted deletions of non existent records.
this is showing up on their telemetry/monitoring systems and they are afraid that when subscription addition event come ( e.g. Special commercial offers
or new phone model launches can bring a marked increase in subscriber addition event on their LDAP Server) and they are afraid the end customer of the
system (it's internally called SRM) will complain if they see the usual excellent performance times of the LDAP Server have become degraded.

The HOTFIX as an interim measure would be preferred and I could ask them if they could wait until 11.7.7 GA for the formal fix.
If they refuse then I guess we could target March 19th 2024.

Appreciate if you could give me an estimate of when a HOTFIX would be possible to arrnage for them as I suspect to calm the nerves of the end users
of their LDAP Directory Server they will ask me if Red Hat can provide that HOTFIX asap etc.

Best Regards

Colum Gaynor - Senior Customer Success Manager - VMO2 Account

— Additional comment from Colum Gaynor on 2024-03-06 09:15:00 UTC —

@tbordaz@redhat.com + @rmarigny@redhat.com

Seems that TESCO system is now getting unwanted queue build up due to the issue so the
pressure for HOTFIX just became more URGENT.

They are also asking if restarting the server once per night would help
---> they noticed after restart cache was more os less empty and response time was initially good.

Colum Gaynor - Senior Customer Success Manager, VMO2 Account

— Additional comment from thierry bordaz on 2024-03-06 09:24:02 UTC —

The bug is that of erroneously reason the server believes that the entry cache contains invalid entries and trim it. The trimming creates the performance issue.
Indeed a restart, flushing the entry cache helps.
I suggest, Renaud, that they test reduction of their entry cache size that may also reduce the performance hit.

— Additional comment from Colum Gaynor on 2024-03-06 10:09:35 UTC —

@tboraz@redhat.com

Thanks for the last comment...

Colum

— Additional comment from thierry bordaz on 2024-03-06 10:12:58 UTC —

Sorry for the noisy updates but just to keep @colum, @renaud and myself on the same page.
We DO need to respin an official 11.7.6 for Nokia case. We will fix in that release (11.7.6) the bugs that VMO2 is likely hitting. 11.7.6 GA is March 19th.
Regarding the hotfix, we are looking if it is possible to give you an early access to it.

— Additional comment from Thorsten Scherf on 2024-03-06 11:50:42 UTC —

Thierry, Viktor, would it be feasible to hand out the QE build to Nokia already early next week?

— Additional comment from Colum Gaynor on 2024-03-06 11:54:29 UTC —

@tbordaz@redhat.com ---> No worries Thierry, understand your situation fully.
Sorry to be aksing requests for both of my cusotmers at the same time!

Any early acccess to HOT FIX greatly appreciated.

Colum

— Additional comment from thierry bordaz on 2024-03-06 13:48:24 UTC —

This bug is fixed in forthcoming RHEL 8.10 and 9.4. As a consequence it is fixed in RHDS 11.9 and 12.4.

— Additional comment from thierry bordaz on 2024-03-06 13:55:30 UTC —

This bug is a tracker for RHDS 11.9. The bug is integrated in that version (not release yet) since it is fixed in 8.10 build 1.4.3.39-2. So any 11.9 build can move it to MODIFIED

external trackers

Github 389ds/389-ds-base/issues/5939

Github 389ds/389-ds-base/issues/5944

Red Hat Customer Portal 03752667

Red Hat Errata Tool 131157

Red Hat Issue Tracker IDMDS-4212

Red Hat Issue Tracker IDMDS-4355

(1 external trackers)