-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Some LDAP group synchronization jobs are sporadically failing. It would eventually generate the following error:
2025-06-23T13:15:01.199024510Z I0623 13:15:01.198855 7 query.go:220] LDAP search: base dn="cn=X446377,cn=GenoIDs,cn=Berechtigungen,o=rudiren" and scope 0 for (objectClass=*) requesting [cn dn] 2025-06-23T13:15:01.199669324Z Error determining LDAP group membership for "OCPGRRDR": membership lookup for user "cn=X446377,cn=GenoIDs,cn=Berechtigungen,o=rudiren" in group "OCPGRRDR" failed because of "unable to read LDAP response packet: read tcp 10.156.9.210:35744->172.22.7.12:8629: read: connection reset by peer". 2025-06-23T13:15:01.199695147Z 2025/06/23 13:15:01 tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.156.9.210:35744->172.22.7.12:8629: write: broken pipe 2025-06-23T13:15:01.199720989Z I0623 13:15:01.199700 7 groupsyncer.go:67] Checking LDAP group OCPTCSAP 2025-06-23T13:15:01.199733426Z I0623 13:15:01.199722 7 query.go:220] LDAP search: base dn="cn=Gruppen,cn=Berechtigungen,o=rudiren" and scope 2 for (&((objectclass=groupofuniquenames))(cn=OCPTCSAP)) requesting [cn uniquemember] 2025-06-23T13:15:01.199752213Z Error determining LDAP group membership for "OCPTCSAP": LDAP Result Code 200 "Network Error": ldap: connection closed.
Most of the times it would work:
$ oc get pods | grep plattform-ldap plattform-ldapgroup-sync-29177280-6n64n 0/1 Error 0 14h plattform-ldapgroup-sync-29177295-smcs7 0/1 Error 0 13h plattform-ldapgroup-sync-29178075-2ctsj 0/1 Error 0 46m plattform-ldapgroup-sync-29178090-rpdks 0/1 Completed 0 31m plattform-ldapgroup-sync-29178105-7mxgk 0/1 Completed 0 16m plattform-ldapgroup-sync-29178120-5ksqd 0/1 Completed 0 1m
The customer also noticed that OpenShift, contrary to other applications, does not end the sessions with an unbind but instead it just disconnects it. And following that lead, there are two jobs that occur near to each other in time (both use the same LDAPSyncConfig custom resource definition):
$ oc get po | awk '$5 ~ "m"' | grep Error
app-ldapgroup-sync-29178080-7z9wp 0/1 Error 0 41m
plattform-ldapgroup-sync-29178075-2ctsj 0/1 Error 0 46m$ oc get cronjobs
NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE
app-ldapgroup-sync 5/15 * * * * <none> False 0 10d 4y
etcd-backup-cronjob 30 1 * * * <none> False 0 10d 4y
plattform-ldapgroup-sync 0/15 * * * * <none> False 0 10d 2y
resourcequotas-cronjob 0 0 1 * * Europe/Berlin False 0 <none> 6d
Packet Captures
Upon packet capturing on both successful and faulty pods, the following behavior was observed (all tcpdump commands used to capture the packets were ran from nodes)
On a faulty pod, I see the connection being dropped midway, with the last frames being:
$ tshark -r 0090-au40v032.sha.vrkw.de_04_07_2025-13_03_14-UTC-error.pcap -Y "ip.src == 10.156.8.65" -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info | tail -n10
412 Jul 4, 2025 13:15:00.978584000 UTC 10.156.8.65 172.22.7.12 [TCP Dup ACK 411#1] 57374 → 8629 [ACK] Seq=6041 Ack=14542 Win=87168 Len=0 TSval=150327893 TSecr=2326016640
413 Jul 4, 2025 13:15:00.978681000 UTC 10.156.8.65 172.22.7.12 Application Data
414 Jul 4, 2025 13:15:00.978688000 UTC 10.156.8.65 172.22.7.12 [TCP Retransmission] 57374 → 8629 [PSH, ACK] Seq=6041 Ack=14542 Win=87168 Len=125 TSval=150327894 TSecr=2326016640
415 Jul 4, 2025 13:15:00.979545000 UTC 10.156.8.65 172.22.7.12 57374 → 8629 [ACK] Seq=6166 Ack=14691 Win=87168 Len=0 TSval=150327894 TSecr=2326016641
416 Jul 4, 2025 13:15:00.979559000 UTC 10.156.8.65 172.22.7.12 [TCP Dup ACK 415#1] 57374 → 8629 [ACK] Seq=6166 Ack=14691 Win=87168 Len=0 TSval=150327894 TSecr=2326016641
417 Jul 4, 2025 13:15:00.979635000 UTC 10.156.8.65 172.22.7.12 Application Data
418 Jul 4, 2025 13:15:00.979651000 UTC 10.156.8.65 172.22.7.12 [TCP Retransmission] 57374 → 8629 [PSH, ACK] Seq=6166 Ack=14691 Win=87168 Len=125 TSval=150327895 TSecr=2326016641
419 Jul 4, 2025 13:15:00.980080000 UTC 10.156.8.65 172.22.7.12 57374 → 8629 [ACK] Seq=6291 Ack=14840 Win=87168 Len=0 TSval=150327895 TSecr=2326016641
420 Jul 4, 2025 13:15:00.980093000 UTC 10.156.8.65 172.22.7.12 [TCP Dup ACK 419#1] 57374 → 8629 [ACK] Seq=6291 Ack=14840 Win=87168 Len=0 TSval=150327895 TSecr=2326016641
421 Jul 4, 2025 13:15:00.990183000 UTC 10.156.8.65 172.22.7.12 Application Data
And the faulty pod logs confirming the connection is dropped midway, because the server most likely sent an RST:
Gruppen Synchronisieren group/OCPCLADM group/OCPCLRDR group/OCPGRRDR group/OCPTCSAP group/OCPTCSMG Error determining LDAP group membership for "OCP31ADM": unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer. unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer 2025/07/04 13:15:00 tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.156.8.65:57374->172.22.7.12:8629: write: broken pipe error: unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer
Now, when looking at the successful pod package captures, I see that it sends a FIN, ACK (e.g. closeNotify alert), then it sends it again because of missing acknowledgment from server, and it finishes by sending a RST to the server because apparently it didn't respond (notice this is the .pcap from the successful pod):
$ tshark -r 0100-au40v032.sha.vrkw.de_04_07_2025-12_51_13-UTC-success.pcap -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info | tail -n10 217 Jul 4, 2025 13:00:01.937405000 UTC 10.156.8.64 172.22.7.12 39820 → 8629 [ACK] Seq=6957 Ack=16482 Win=87168 Len=0 TSval=113796993 TSecr=2325138636 218 Jul 4, 2025 13:00:01.937419000 UTC 10.156.8.64 172.22.7.12 [TCP Dup ACK 217#1] 39820 → 8629 [ACK] Seq=6957 Ack=16482 Win=87168 Len=0 TSval=113796993 TSecr=2325138636 219 Jul 4, 2025 13:00:02.068097000 UTC 10.156.8.64 172.22.7.12 Encrypted Alert 220 Jul 4, 2025 13:00:02.068130000 UTC 10.156.8.64 172.22.7.12 [TCP Retransmission] 39820 → 8629 [PSH, ACK] Seq=6957 Ack=16482 Win=87168 Len=31 TSval=113797124 TSecr=2325138636 221 Jul 4, 2025 13:00:02.068185000 UTC 10.156.8.64 172.22.7.12 39820 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=113797124 TSecr=2325138636 222 Jul 4, 2025 13:00:02.068191000 UTC 10.156.8.64 172.22.7.12 [TCP Retransmission] 39820 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=113797124 TSecr=2325138636 223 Jul 4, 2025 13:00:02.078606000 UTC 10.156.8.64 172.22.7.12 39820 → 8629 [RST] Seq=6989 Win=0 Len=0 224 Jul 4, 2025 13:00:02.078620000 UTC 10.156.8.64 172.22.7.12 39820 → 8629 [RST] Seq=6989 Win=0 Len=0 225 Jul 4, 2025 13:00:02.079931000 UTC 10.156.8.64 172.22.7.12 39820 → 8629 [RST] Seq=6989 Win=0 Len=0 226 Jul 4, 2025 13:00:02.079950000 UTC 10.156.8.64 172.22.7.12 39820 → 8629 [RST] Seq=6989 Win=0 Len=0
On the above captures, it was configured just to get packets in which the LDAP server IP address was the destination, below, there are some observations on a bidirectional capture:
In the last frames of the packet capture from a faulty pod, specifically in the frames 471 and 472, I see the same data being sent from both the pod IP (10.156.8.10), and from the IP of the node in which this pod is scheduled (172.22.242.136), as well as the same port (41350), and then, the LDAP server sends a RST to both abruptly. I would consider this normal, but I don't see in any other frame a packet being sent from the node IP to the LDAP server (none on successful jobs packet captures too), I also don't see that connection being started with the node IP:
$ tshark -r 0110-au40v032.sha.vrkw.de_07_07_2025-11_33_32-UTC-error.pcap -T fields -e frame.number -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.len -e _ws.col.Info| tail -n10 469 Jul 7, 2025 12:00:01.223364000 UTC 172.22.7.12 8629 10.156.8.10 41350 43 Application Data 470 Jul 7, 2025 12:00:01.223412000 UTC 172.22.7.12 8629 10.156.8.10 41350 43 [TCP Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747331 TSecr=3030647547 471 Jul 7, 2025 12:00:01.223578000 UTC 10.156.8.10 41350 172.22.7.12 8629 166 Application Data 472 Jul 7, 2025 12:00:01.223889000 UTC 172.22.242.136 41350 172.22.7.12 8629 166 Application Data 473 Jul 7, 2025 12:00:01.224112000 UTC 172.22.7.12 8629 172.22.242.136 41350 0 8629 → 41350 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 474 Jul 7, 2025 12:00:01.224555000 UTC 172.22.7.12 8629 10.156.8.10 41350 0 8629 → 41350 [RST, ACK] Seq=5948 Ack=461 Win=0 Len=0 475 Jul 7, 2025 12:00:01.806657000 UTC 172.22.7.12 8629 10.156.8.10 41350 43 [TCP Spurious Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747900 TSecr=3030647547 476 Jul 7, 2025 12:00:01.806718000 UTC 172.22.7.12 8629 10.156.8.10 41350 43 [TCP Spurious Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747900 TSecr=3030647547 477 Jul 7, 2025 12:00:01.806746000 UTC 10.156.8.10 41350 172.22.7.12 8629 0 41350 → 8629 [RST] Seq=461 Win=0 Len=0 478 Jul 7, 2025 12:00:01.806759000 UTC 10.156.8.10 41350 172.22.7.12 8629 0 41350 → 8629 [RST] Seq=461 Win=0 Len=0
On the successful pod, I see the connection being terminated gracefully, and no packets being sent from the node IP itself whatsoever:
$ tshark -r 0120-au40v032.sha.vrkw.de_07_07_2025-08_03_06-UTC-success.pcap -T fields -e frame.number -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.len -e _ws.col.Info | tail -n10 433 Jul 7, 2025 08:15:01.517860000 UTC 10.156.8.226 46668 172.22.7.12 8629 31 Encrypted Alert 434 Jul 7, 2025 08:15:01.517880000 UTC 10.156.8.226 46668 172.22.7.12 8629 31 [TCP Retransmission] 46668 → 8629 [PSH, ACK] Seq=6957 Ack=16482 Win=87168 Len=31 TSval=3831432777 TSecr=2561563997 435 Jul 7, 2025 08:15:01.517945000 UTC 10.156.8.226 46668 172.22.7.12 8629 0 46668 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=3831432777 TSecr=2561563997 436 Jul 7, 2025 08:15:01.517952000 UTC 10.156.8.226 46668 172.22.7.12 8629 0 [TCP Retransmission] 46668 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=3831432777 TSecr=2561563997 437 Jul 7, 2025 08:15:01.518400000 UTC 172.22.7.12 8629 10.156.8.226 46668 0 8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=0 TSval=2561564029 TSecr=3831432777 438 Jul 7, 2025 08:15:01.518430000 UTC 172.22.7.12 8629 10.156.8.226 46668 0 [TCP Dup ACK 437#1] 8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=0 TSval=2561564029 TSecr=3831432777 439 Jul 7, 2025 08:15:01.518713000 UTC 172.22.7.12 8629 10.156.8.226 46668 31 Encrypted Alert 440 Jul 7, 2025 08:15:01.518725000 UTC 172.22.7.12 8629 10.156.8.226 46668 31 [TCP Retransmission] 8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=31 TSval=2561564030 TSecr=3831432777 441 Jul 7, 2025 08:15:01.518739000 UTC 10.156.8.226 46668 172.22.7.12 8629 0 46668 → 8629 [RST] Seq=6989 Win=0 Len=0 442 Jul 7, 2025 08:15:01.518748000 UTC 10.156.8.226 46668 172.22.7.12 8629 0 46668 → 8629 [RST] Seq=6989 Win=0 Len=0
The customer uses no proxy or firewall, and wasn't able to identify any other network issues. Also, I find suspect the connection being reset by the LDAP server when the node IP itself sends a duplicate packet (from pod IP) to the LDAP server.
Version-Release number of selected component (if applicable):
4.18.16