Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-77762

BZ#2298210 galera socat cluster monitoring is flaky

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • None
    • socat
    • None
    • Moderate
    • rhel-stacks-services-scripting
    • ssg_core_services
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None

      galera socat cluster monitoring is flaky and here's a complete run-down of what the customer found and how:

      We saw the DOWNs in haproxy.log, did not understand where they were coming from since everything including mysql was stable.

      We started hammering the endpoints using

      time while true; do if curl -s http://REMOTE_HOST_GOES_HERE:9200/ |egrep -q 'is sync'; then sleep 0.01s; else echo failed; break; fi; done
      and tcpdumped on both SRC and DST.

      The tcpdumps we got indicated that the DST / "Server" was sending "ACK", "PSH ACK" and "RST ACK" in short succession without anything in between from the SRC/"Client".

      The construct that's "serving" clustercheck on RHOSP17 looks like this:

      bash-5.1$ cat /run_command
      bash -c $* – eval source /etc/sysconfig/clustercheck; exec socat "$TRIPLEO_SOCAT_BIND" system:"/usr/bin/clustercheck; sleep '${TRIPLEO_POST_STATUS_WAIT:-0}'",nofork
      with the variables as follows:

      bash-5.1$ cat /etc/sysconfig/clustercheck
      MYSQL_USERNAME=clustercheck

      MYSQL_PASSWORD='PASSWORD_GOES_HERE'

      MYSQL_HOST=localhost

      TRIPLEO_SOCAT_BIND='tcp4-listen:9200,bind="LOCAL_BIND_IP_GOES_HERE",reuseaddr,fork'

      TRIPLEO_POST_STATUS_WAIT=0
      To eliminate as many possible causes we removed podman/kolla and mysql from the equation by creating a runner-script:

      [root@controller03 cctest]# cat runner.sh
      #!/bin/bash
      TRIPLEO_SOCAT_BIND='tcp6-listen:9200,bind="[IP_GOES_HERE]",reuseaddr,fork'
      TRIPLEO_POST_STATUS_WAIT=0
      socat "$TRIPLEO_SOCAT_BIND" system:"./responder.sh; sleep 0",nofork
      and a responder-script:

      [root@controller03 cctest]# cat responder.sh
      #!/bin/bash
      echo -en "HTTP/1.1 200 OK\r\n"
      echo -en "Content-Type: text/plain\r\n"
      echo -en "Connection: close\r\n"
      echo -en "Content-Length: 32\r\n"
      echo -en "\r\n"
      echo -en "Galera cluster node is synced.\r\n"
      sleep 0.1
      exit 0
      To ensure that we also got as many network components out of the way we moved to IPv6 on a network that was not part of the OVN-stack but instead a plain network directly on NICs.
      Since our networks are all bonds across two devices we also shut down one device per bond and tcpdumped on the en*-interface directly.

      We were still able to reproduce the issue in a reliable fashion.

      Finally - to figure out if we were missing anything here - we mocked up a tiny python-server that does exactly the same thing socat/clustercheck are doing:

      import socket
      import threading

      1. Define the response headers and content
        response_headers = """\
        HTTP/1.1 200 OK
        Content-Type: text/plain
        Connection: close
        Content-Length: 32

      """
      response_content = "Galera cluster node is synced.\r\n"

      1. Define the server socket
        server_socket = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
        server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        server_socket.bind(('aaaa:aaaa:aaaa:aaa::', 9200))
        server_socket.listen(5)
        print("Server listening on port 9200...")

      def handle_client(client_socket):
      """
      Handle incoming client connections and send the response.
      """
      request = client_socket.recv(1024)
      if request.startswith(b'GET'):
      response = response_headers.encode() + response_content.encode()
      client_socket.sendall(response)
      client_socket.close()

      while True:
      client_socket, client_address = server_socket.accept()
      client_thread = threading.Thread(target=handle_client, args=(client_socket,))
      client_thread.start()
      and subjected that to the same hammering we threw against the socat-solution.
      While socat reliably failed in under 30 minutes the python-solution has been running for 20 hours without a single fail.

      I will attach two tcpdumps from the DST/"Server".
      One will contain two tcp-streams, with tcp.stream eq 0being a "working" conversation between client and server, and tcp.stream eq 1 being a broken conversation that the server aborted.
      The other one contains a tcp-stream from the python-solution for reference, for all we can tell that's not only reliably working but also a lot cleaner since there's no "RST,ACK" in there at all - we assume those are caused by socat being forcefully closed when the bash-script terminates.

      Our assumption here is that socat is not really suited for this kind of application. We suspect that it's some kind of threading-thing, but we lack the means to debug/prove that.

              rhn-support-mosvald Martin Osvald
              jira-bugzilla-migration RH Bugzilla Integration
              Martin Osvald Martin Osvald
              Frantisek Hrdina Frantisek Hrdina
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: