Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59875

LDAP group synchronization sporadically failing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.18.z
    • oauth-apiserver
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:
      Some LDAP group synchronization jobs are sporadically failing. It would eventually generate the following error:

      2025-06-23T13:15:01.199024510Z I0623 13:15:01.198855       7 query.go:220] LDAP search: base dn="cn=X446377,cn=GenoIDs,cn=Berechtigungen,o=rudiren" and scope 0 for (objectClass=*) requesting [cn dn]
      2025-06-23T13:15:01.199669324Z Error determining LDAP group membership for "OCPGRRDR": membership lookup for user "cn=X446377,cn=GenoIDs,cn=Berechtigungen,o=rudiren" in group "OCPGRRDR" failed because of "unable to read LDAP response packet: read tcp 10.156.9.210:35744->172.22.7.12:8629: read: connection reset by peer".
      2025-06-23T13:15:01.199695147Z 2025/06/23 13:15:01 tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.156.9.210:35744->172.22.7.12:8629: write: broken pipe
      2025-06-23T13:15:01.199720989Z I0623 13:15:01.199700       7 groupsyncer.go:67] Checking LDAP group OCPTCSAP
      2025-06-23T13:15:01.199733426Z I0623 13:15:01.199722       7 query.go:220] LDAP search: base dn="cn=Gruppen,cn=Berechtigungen,o=rudiren" and scope 2 for (&((objectclass=groupofuniquenames))(cn=OCPTCSAP)) requesting [cn uniquemember]
      2025-06-23T13:15:01.199752213Z Error determining LDAP group membership for "OCPTCSAP": LDAP Result Code 200 "Network Error": ldap: connection closed.

      Most of the times it would work:

      $ oc get pods | grep plattform-ldap
      plattform-ldapgroup-sync-29177280-6n64n     0/1     Error       0          14h
      plattform-ldapgroup-sync-29177295-smcs7     0/1     Error       0          13h
      plattform-ldapgroup-sync-29178075-2ctsj     0/1     Error       0          46m
      plattform-ldapgroup-sync-29178090-rpdks     0/1     Completed   0          31m
      plattform-ldapgroup-sync-29178105-7mxgk     0/1     Completed   0          16m
      plattform-ldapgroup-sync-29178120-5ksqd     0/1     Completed   0          1m 

      The customer also noticed that OpenShift, contrary to other applications, does not end the sessions with an unbind but instead it just disconnects it. And following that lead, there are two jobs that occur near to each other in time (both use the same LDAPSyncConfig custom resource definition):

      $ oc get po | awk '$5 ~ "m"' | grep Error
      app-ldapgroup-sync-29178080-7z9wp           0/1     Error       0          41m
      plattform-ldapgroup-sync-29178075-2ctsj     0/1     Error       0          46m$ oc get cronjobs
      NAME                       SCHEDULE       TIMEZONE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE
      app-ldapgroup-sync         5/15 * * * *   <none>          False     0        10d             4y
      etcd-backup-cronjob        30 1 * * *     <none>          False     0        10d             4y
      plattform-ldapgroup-sync   0/15 * * * *   <none>          False     0        10d             2y
      resourcequotas-cronjob     0 0 1 * *      Europe/Berlin   False     0        <none>          6d 

      Packet Captures
      Upon packet capturing on both successful and faulty pods, the following behavior was observed (all tcpdump commands used to capture the packets were ran from nodes)

      On a faulty pod, I see the connection being dropped midway, with the last frames being:

      $ tshark -r 0090-au40v032.sha.vrkw.de_04_07_2025-13_03_14-UTC-error.pcap -Y "ip.src == 10.156.8.65" -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info | tail -n10
      412    Jul  4, 2025 13:15:00.978584000 UTC    10.156.8.65    172.22.7.12    [TCP Dup ACK 411#1] 57374 → 8629 [ACK] Seq=6041 Ack=14542 Win=87168 Len=0 TSval=150327893 TSecr=2326016640
      413    Jul  4, 2025 13:15:00.978681000 UTC    10.156.8.65    172.22.7.12    Application Data
      414    Jul  4, 2025 13:15:00.978688000 UTC    10.156.8.65    172.22.7.12    [TCP Retransmission] 57374 → 8629 [PSH, ACK] Seq=6041 Ack=14542 Win=87168 Len=125 TSval=150327894 TSecr=2326016640
      415    Jul  4, 2025 13:15:00.979545000 UTC    10.156.8.65    172.22.7.12    57374 → 8629 [ACK] Seq=6166 Ack=14691 Win=87168 Len=0 TSval=150327894 TSecr=2326016641
      416    Jul  4, 2025 13:15:00.979559000 UTC    10.156.8.65    172.22.7.12    [TCP Dup ACK 415#1] 57374 → 8629 [ACK] Seq=6166 Ack=14691 Win=87168 Len=0 TSval=150327894 TSecr=2326016641
      417    Jul  4, 2025 13:15:00.979635000 UTC    10.156.8.65    172.22.7.12    Application Data
      418    Jul  4, 2025 13:15:00.979651000 UTC    10.156.8.65    172.22.7.12    [TCP Retransmission] 57374 → 8629 [PSH, ACK] Seq=6166 Ack=14691 Win=87168 Len=125 TSval=150327895 TSecr=2326016641
      419    Jul  4, 2025 13:15:00.980080000 UTC    10.156.8.65    172.22.7.12    57374 → 8629 [ACK] Seq=6291 Ack=14840 Win=87168 Len=0 TSval=150327895 TSecr=2326016641
      420    Jul  4, 2025 13:15:00.980093000 UTC    10.156.8.65    172.22.7.12    [TCP Dup ACK 419#1] 57374 → 8629 [ACK] Seq=6291 Ack=14840 Win=87168 Len=0 TSval=150327895 TSecr=2326016641
      421    Jul  4, 2025 13:15:00.990183000 UTC    10.156.8.65    172.22.7.12    Application Data 

      And the faulty pod logs confirming the connection is dropped midway, because the server most likely sent an RST:

      Gruppen Synchronisieren
      group/OCPCLADM
      group/OCPCLRDR
      group/OCPGRRDR
      group/OCPTCSAP
      group/OCPTCSMG
      Error determining LDAP group membership for "OCP31ADM": unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer.
      unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer
      2025/07/04 13:15:00 tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.156.8.65:57374->172.22.7.12:8629: write: broken pipe
      error: unable to read LDAP response packet: read tcp 10.156.8.65:57374->172.22.7.12:8629: read: connection reset by peer 

      Now, when looking at the successful pod package captures, I see that it sends a FIN, ACK (e.g. closeNotify alert), then it sends it again because of missing acknowledgment from server, and it finishes by sending a RST to the server because apparently it didn't respond (notice this is the .pcap from the successful pod):

      $ tshark -r 0100-au40v032.sha.vrkw.de_04_07_2025-12_51_13-UTC-success.pcap  -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info | tail -n10
      217    Jul  4, 2025 13:00:01.937405000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [ACK] Seq=6957 Ack=16482 Win=87168 Len=0 TSval=113796993 TSecr=2325138636
      218    Jul  4, 2025 13:00:01.937419000 UTC    10.156.8.64    172.22.7.12    [TCP Dup ACK 217#1] 39820 → 8629 [ACK] Seq=6957 Ack=16482 Win=87168 Len=0 TSval=113796993 TSecr=2325138636
      219    Jul  4, 2025 13:00:02.068097000 UTC    10.156.8.64    172.22.7.12    Encrypted Alert
      220    Jul  4, 2025 13:00:02.068130000 UTC    10.156.8.64    172.22.7.12    [TCP Retransmission] 39820 → 8629 [PSH, ACK] Seq=6957 Ack=16482 Win=87168 Len=31 TSval=113797124 TSecr=2325138636
      221    Jul  4, 2025 13:00:02.068185000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=113797124 TSecr=2325138636
      222    Jul  4, 2025 13:00:02.068191000 UTC    10.156.8.64    172.22.7.12    [TCP Retransmission] 39820 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=113797124 TSecr=2325138636
      223    Jul  4, 2025 13:00:02.078606000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0
      224    Jul  4, 2025 13:00:02.078620000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0
      225    Jul  4, 2025 13:00:02.079931000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0
      226    Jul  4, 2025 13:00:02.079950000 UTC    10.156.8.64    172.22.7.12    39820 → 8629 [RST] Seq=6989 Win=0 Len=0 

      On the above captures, it was configured just to get packets in which the LDAP server IP address was the destination, below, there are some observations on a bidirectional capture:
      In the last frames of the packet capture from a faulty pod, specifically in the frames 471 and 472, I see the same data being sent from both the pod IP (10.156.8.10), and from the IP of the node in which this pod is scheduled (172.22.242.136), as well as the same port (41350), and then, the LDAP server sends a RST to both abruptly. I would consider this normal, but I don't see in any other frame a packet being sent from the node IP to the LDAP server (none on successful jobs packet captures too), I also don't see that connection being started with the node IP:

      $ tshark -r 0110-au40v032.sha.vrkw.de_07_07_2025-11_33_32-UTC-error.pcap -T fields -e frame.number -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.len -e _ws.col.Info| tail -n10
      469    Jul  7, 2025 12:00:01.223364000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    Application Data
      470    Jul  7, 2025 12:00:01.223412000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    [TCP Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747331 TSecr=3030647547
      471    Jul  7, 2025 12:00:01.223578000 UTC    10.156.8.10    41350    172.22.7.12    8629    166    Application Data
      472    Jul  7, 2025 12:00:01.223889000 UTC    172.22.242.136    41350    172.22.7.12    8629    166    Application Data
      473    Jul  7, 2025 12:00:01.224112000 UTC    172.22.7.12    8629    172.22.242.136    41350    0    8629 → 41350 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
      474    Jul  7, 2025 12:00:01.224555000 UTC    172.22.7.12    8629    10.156.8.10    41350    0    8629 → 41350 [RST, ACK] Seq=5948 Ack=461 Win=0 Len=0
      475    Jul  7, 2025 12:00:01.806657000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    [TCP Spurious Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747900 TSecr=3030647547
      476    Jul  7, 2025 12:00:01.806718000 UTC    172.22.7.12    8629    10.156.8.10    41350    43    [TCP Spurious Retransmission] 8629 → 41350 [PSH, ACK] Seq=5905 Ack=461 Win=2097024 Len=43 TSval=2574747900 TSecr=3030647547
      477    Jul  7, 2025 12:00:01.806746000 UTC    10.156.8.10    41350    172.22.7.12    8629    0    41350 → 8629 [RST] Seq=461 Win=0 Len=0
      478    Jul  7, 2025 12:00:01.806759000 UTC    10.156.8.10    41350    172.22.7.12    8629    0    41350 → 8629 [RST] Seq=461 Win=0 Len=0 

      On the successful pod, I see the connection being terminated gracefully, and no packets being sent from the node IP itself whatsoever:

      $ tshark -r 0120-au40v032.sha.vrkw.de_07_07_2025-08_03_06-UTC-success.pcap -T fields -e frame.number -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.len -e _ws.col.Info | tail -n10
      433    Jul  7, 2025 08:15:01.517860000 UTC    10.156.8.226    46668    172.22.7.12    8629    31    Encrypted Alert
      434    Jul  7, 2025 08:15:01.517880000 UTC    10.156.8.226    46668    172.22.7.12    8629    31    [TCP Retransmission] 46668 → 8629 [PSH, ACK] Seq=6957 Ack=16482 Win=87168 Len=31 TSval=3831432777 TSecr=2561563997
      435    Jul  7, 2025 08:15:01.517945000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    46668 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=3831432777 TSecr=2561563997
      436    Jul  7, 2025 08:15:01.517952000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    [TCP Retransmission] 46668 → 8629 [FIN, ACK] Seq=6988 Ack=16482 Win=87168 Len=0 TSval=3831432777 TSecr=2561563997
      437    Jul  7, 2025 08:15:01.518400000 UTC    172.22.7.12    8629    10.156.8.226    46668    0    8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=0 TSval=2561564029 TSecr=3831432777
      438    Jul  7, 2025 08:15:01.518430000 UTC    172.22.7.12    8629    10.156.8.226    46668    0    [TCP Dup ACK 437#1] 8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=0 TSval=2561564029 TSecr=3831432777
      439    Jul  7, 2025 08:15:01.518713000 UTC    172.22.7.12    8629    10.156.8.226    46668    31    Encrypted Alert
      440    Jul  7, 2025 08:15:01.518725000 UTC    172.22.7.12    8629    10.156.8.226    46668    31    [TCP Retransmission] 8629 → 46668 [PSH, ACK] Seq=16482 Ack=6989 Win=2097120 Len=31 TSval=2561564030 TSecr=3831432777
      441    Jul  7, 2025 08:15:01.518739000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    46668 → 8629 [RST] Seq=6989 Win=0 Len=0
      442    Jul  7, 2025 08:15:01.518748000 UTC    10.156.8.226    46668    172.22.7.12    8629    0    46668 → 8629 [RST] Seq=6989 Win=0 Len=0 

      The customer uses no proxy or firewall, and wasn't able to identify any other network issues. Also, I find suspect the connection being reset by the LDAP server when the node IP itself sends a duplicate packet (from pod IP) to the LDAP server.

      Version-Release number of selected component (if applicable):

      4.18.16

              Unassigned Unassigned
              rhn-support-palmeida Patrick Almeida
              None
              None
              Xingxing Xia Xingxing Xia
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: