Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: oauth-apiserver
Labels:
- LDAP
- RST
- oauth

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:
Some LDAP group synchronization jobs are sporadically failing. It would eventually generate the following error:

2025-06-23T13:15:01.199024510Z I0623 13:15:01.198855       7 query.go:220] LDAP search: base dn="cn=X446377,cn=GenoIDs,cn=Berechtigungen,o=rudiren" and scope 0 for (objectClass=*) requesting [cn dn]
2025-06-23T13:15:01.199669324Z Error determining LDAP group membership for "OCPGRRDR": membership lookup for user "cn=X446377,cn=GenoIDs,cn=Berechtigungen,o=rudiren" in group "OCPGRRDR" failed because of "unable to read LDAP response packet: read tcp 10.156.9.210:35744->172.22.7.12:8629: read: connection reset by peer".
2025-06-23T13:15:01.199695147Z 2025/06/23 13:15:01 tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.156.9.210:35744->172.22.7.12:8629: write: broken pipe
2025-06-23T13:15:01.199720989Z I0623 13:15:01.199700       7 groupsyncer.go:67] Checking LDAP group OCPTCSAP
2025-06-23T13:15:01.199733426Z I0623 13:15:01.199722       7 query.go:220] LDAP search: base dn="cn=Gruppen,cn=Berechtigungen,o=rudiren" and scope 2 for (&((objectclass=groupofuniquenames))(cn=OCPTCSAP)) requesting [cn uniquemember]
2025-06-23T13:15:01.199752213Z Error determining LDAP group membership for "OCPTCSAP": LDAP Result Code 200 "Network Error": ldap: connection closed.

Most of the times it would work:

$ oc get pods | grep plattform-ldap
plattform-ldapgroup-sync-29177280-6n64n     0/1     Error       0          14h
plattform-ldapgroup-sync-29177295-smcs7     0/1     Error       0          13h
plattform-ldapgroup-sync-29178075-2ctsj     0/1     Error       0          46m
plattform-ldapgroup-sync-29178090-rpdks     0/1     Completed   0          31m
plattform-ldapgroup-sync-29178105-7mxgk     0/1     Completed   0          16m
plattform-ldapgroup-sync-29178120-5ksqd     0/1     Completed   0          1m

The customer also noticed that OpenShift, contrary to other applications, does not end the sessions with an unbind but instead it just disconnects it. And following that lead, there are two jobs that occur near to each other in time (both use the same LDAPSyncConfig custom resource definition):

$ oc get po | awk '$5 ~ "m"' | grep Error
app-ldapgroup-sync-29178080-7z9wp           0/1     Error       0          41m
plattform-ldapgroup-sync-29178075-2ctsj     0/1     Error       0          46m$ oc get cronjobs
NAME                       SCHEDULE       TIMEZONE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE
app-ldapgroup-sync         5/15 * * * *   <none>          False     0        10d             4y
etcd-backup-cronjob        30 1 * * *     <none>          False     0        10d             4y
plattform-ldapgroup-sync   0/15 * * * *   <none>          False     0        10d             2y
resourcequotas-cronjob     0 0 1 * *      Europe/Berlin   False     0        <none>          6d

Packet Captures
Upon packet capturing on both successful and faulty pods, the following behavior was observed (all tcpdump commands used to capture the packets were ran from nodes)

On a faulty pod, I see the connection being dropped midway, with the last frames being:

$ tshark -r 0090-au40v032.sha.vrkw.de_04_07_2025-13_03_14-UTC-error.pcap -Y "ip.src == 10.156.8.65" -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info | tail -n10
412    Jul  4, 2025 13:15:00.978584000 UTC    10.156.8.65    172.22.7.12    [TCP Dup ACK 411#1] 57374 → 8629 [ACK] Seq=6041 Ack=14542 Win=87168 Len=0 TSval=150327893 TSecr=2326016640
413    Jul  4, 2025 13:15:00.978681000 UTC    10.156.8.65    172.22.7.12    Application Data
414    Jul  4, 2025 13:15:00.978688000 UTC    10.156.8.65    172.22.7.12    [TCP Retransmission] 57374 → 8629 [PSH, ACK] Seq=6041 Ack=14542 Win=87168 Len=125 TSval=150327894 TSecr=2326016640
415    Jul  4, 2025 13:15:00.979545000 UTC    10.156.8.65    172.22.7.12    57374 → 8629 [ACK] Seq=6166 Ack=14691 Win=87168 Len=0 TSval=150327894 TSecr=2326016641
416    Jul  4, 2025 13:15:00.979559000 UTC    10.156.8.65    172.22.7.12    [TCP Dup ACK 415#1] 57374 → 8629 [ACK] Seq=6166 Ack=14691 Win=87168 Len=0 TSval=150327894 TSecr=2326016641
417    Jul  4, 2025 13:15:00.979635000 UTC    10.156.8.65    172.22.7.12    Application Data
418    Jul  4, 2025 13:15:00.979651000 UTC    10.156.8.65    172.22.7.12    [TCP Retransmission] 57374 → 8629 [PSH, ACK] Seq=6166 Ack=14691 Win=87168 Len=125 TSval=150327895 TSecr=2326016641
419    Jul  4, 2025 13:15:00.980080000 UTC    10.156.8.65    172.22.7.12    57374 → 8629 [ACK] Seq=6291 Ack=14840 Win=87168 Len=0 TSval=150327895 TSecr=2326016641
420    Jul  4, 2025 13:15:00.980093000 UTC    10.156.8.65    172.22.7.12    [TCP Dup ACK 419#1] 57374 → 8629 [ACK] Seq=6291 Ack=14840 Win=87168 Len=0 TSval=150327895 TSecr=2326016641
421    Jul  4, 2025 13:15:00.990183000 UTC    10.156.8.65    172.22.7.12    Application Data

And the faulty pod logs confirming the connection is dropped midway, because the server most likely sent an RST:

Gruppen Synchronisieren
group/OCPCLADM
group/OCPCLRDR
group/OCPGRRDR
group/OCPTCSAP
group/OCPTCSMG
Error determining LDAP group membership for "OCP31ADM": unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer.
unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer
2025/07/04 13:15:00 tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.156.8.65:57374->172.22.7.12:8629: write: broken pipe
error: unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer

Now, when looking at the successful pod package captures, I see that it sends a FIN, ACK (e.g. closeNotify alert), then it sends it again because of missing acknowledgment from server, and it finishes by sending a RST to the server because apparently it didn't respond (notice this is the .pcap from the successful pod):

$ tshark -r 0100-au40v032.sha.vrkw.de_04_07_2025-12_51_13-UTC-success.pcap  -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info | tail -n10
217    Jul  4, 2025 13:00:01.937405000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [ACK] Seq=6957 Ack=16482 Win=87168 Len=0 TSval=113796993 TSecr=2325138636
218    Jul  4, 2025 13:00:01.937419000 UTC    10.156.8.64    172.22.7.12    [TCP Dup ACK 217#1] 39820 → 8629 [ACK] Seq=6957 Ack=16482 Win=87168 Len=0 TSval=113796993 TSecr=2325138636
219    Jul  4, 2025 13:00:02.068097000 UTC    10.156.8.64    172.22.7.12    Encrypted Alert
220    Jul  4, 2025 13:00:02.068130000 UTC    10.156.8.64    172.22.7.12    [TCP Retransmission] 39820 → 8629 [PSH, ACK] Seq=6957 Ack=16482 Win=87168 Len=31 TSval=113797124 TSecr=2325138636
221    Jul  4, 2025 13:00:02.068185000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=113797124 TSecr=2325138636
222    Jul  4, 2025 13:00:02.068191000 UTC    10.156.8.64    172.22.7.12    [TCP Retransmission] 39820 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=113797124 TSecr=2325138636
223    Jul  4, 2025 13:00:02.078606000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0
224    Jul  4, 2025 13:00:02.078620000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0
225    Jul  4, 2025 13:00:02.079931000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0
226    Jul  4, 2025 13:00:02.079950000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0

On the above captures, it was configured just to get packets in which the LDAP server IP address was the destination, below, there are some observations on a bidirectional capture:
In the last frames of the packet capture from a faulty pod, specifically in the frames 471 and 472, I see the same data being sent from both the pod IP (10.156.8.10), and from the IP of the node in which this pod is scheduled (172.22.242.136), as well as the same port (41350), and then, the LDAP server sends a RST to both abruptly. I would consider this normal, but I don't see in any other frame a packet being sent from the node IP to the LDAP server (none on successful jobs packet captures too), I also don't see that connection being started with the node IP:

$ tshark -r 0110-au40v032.sha.vrkw.de_07_07_2025-11_33_32-UTC-error.pcap -T fields -e frame.number -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.len -e _ws.col.Info| tail -n10
469    Jul  7, 2025 12:00:01.223364000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    Application Data
470    Jul  7, 2025 12:00:01.223412000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    [TCP Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747331 TSecr=3030647547
471    Jul  7, 2025 12:00:01.223578000 UTC    10.156.8.10    41350    172.22.7.12    8629    166    Application Data
472    Jul  7, 2025 12:00:01.223889000 UTC    172.22.242.136    41350    172.22.7.12    8629    166    Application Data
473    Jul  7, 2025 12:00:01.224112000 UTC    172.22.7.12    8629    172.22.242.136    41350    0    8629 → 41350 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
474    Jul  7, 2025 12:00:01.224555000 UTC    172.22.7.12    8629    10.156.8.10    41350    0    8629 → 41350 [RST, ACK] Seq=5948 Ack=461 Win=0 Len=0
475    Jul  7, 2025 12:00:01.806657000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    [TCP Spurious Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747900 TSecr=3030647547
476    Jul  7, 2025 12:00:01.806718000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    [TCP Spurious Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747900 TSecr=3030647547
477    Jul  7, 2025 12:00:01.806746000 UTC    10.156.8.10    41350    172.22.7.12    8629    0    41350 → 8629 [RST] Seq=461 Win=0 Len=0
478    Jul  7, 2025 12:00:01.806759000 UTC    10.156.8.10    41350    172.22.7.12    8629    0    41350 → 8629 [RST] Seq=461 Win=0 Len=0

On the successful pod, I see the connection being terminated gracefully, and no packets being sent from the node IP itself whatsoever:

$ tshark -r 0120-au40v032.sha.vrkw.de_07_07_2025-08_03_06-UTC-success.pcap -T fields -e frame.number -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.len -e _ws.col.Info | tail -n10
433    Jul  7, 2025 08:15:01.517860000 UTC    10.156.8.226    46668    172.22.7.12    8629    31    Encrypted Alert
434    Jul  7, 2025 08:15:01.517880000 UTC    10.156.8.226    46668    172.22.7.12    8629    31    [TCP Retransmission] 46668 → 8629 [PSH, ACK] Seq=6957 Ack=16482 Win=87168 Len=31 TSval=3831432777 TSecr=2561563997
435    Jul  7, 2025 08:15:01.517945000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    46668 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=3831432777 TSecr=2561563997
436    Jul  7, 2025 08:15:01.517952000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    [TCP Retransmission] 46668 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=3831432777 TSecr=2561563997
437    Jul  7, 2025 08:15:01.518400000 UTC    172.22.7.12    8629    10.156.8.226    46668    0    8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=0 TSval=2561564029 TSecr=3831432777
438    Jul  7, 2025 08:15:01.518430000 UTC    172.22.7.12    8629    10.156.8.226    46668    0    [TCP Dup ACK 437#1] 8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=0 TSval=2561564029 TSecr=3831432777
439    Jul  7, 2025 08:15:01.518713000 UTC    172.22.7.12    8629    10.156.8.226    46668    31    Encrypted Alert
440    Jul  7, 2025 08:15:01.518725000 UTC    172.22.7.12    8629    10.156.8.226    46668    31    [TCP Retransmission] 8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=31 TSval=2561564030 TSecr=3831432777
441    Jul  7, 2025 08:15:01.518739000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    46668 → 8629 [RST] Seq=6989 Win=0 Len=0
442    Jul  7, 2025 08:15:01.518748000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    46668 → 8629 [RST] Seq=6989 Win=0 Len=0

The customer uses no proxy or firewall, and wasn't able to identify any other network issues. Also, I find suspect the connection being reset by the LDAP server when the node IP itself sends a duplicate packet (from pod IP) to the LDAP server.

Version-Release number of selected component (if applicable):

4.18.16

Assignee:: Unassigned

Reporter:: Patrick Almeida

Need Info From:: None

Contributors:: None

QA Contact:: Xingxing Xia

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/07/28 2:41 PM

Updated:: 2025/09/23 2:22 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide