-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
None
-
None
-
Moderate
-
rhel-stacks-services-scripting
-
ssg_core_services
-
None
-
False
-
False
-
-
None
-
None
-
None
-
None
-
None
galera socat cluster monitoring is flaky and here's a complete run-down of what the customer found and how:
We saw the DOWNs in haproxy.log, did not understand where they were coming from since everything including mysql was stable.
We started hammering the endpoints using
time while true; do if curl -s http://REMOTE_HOST_GOES_HERE:9200/ |egrep -q 'is sync'; then sleep 0.01s; else echo failed; break; fi; done
and tcpdumped on both SRC and DST.
The tcpdumps we got indicated that the DST / "Server" was sending "ACK", "PSH ACK" and "RST ACK" in short succession without anything in between from the SRC/"Client".
The construct that's "serving" clustercheck on RHOSP17 looks like this:
bash-5.1$ cat /run_command
bash -c $* – eval source /etc/sysconfig/clustercheck; exec socat "$TRIPLEO_SOCAT_BIND" system:"/usr/bin/clustercheck; sleep '${TRIPLEO_POST_STATUS_WAIT:-0}'",nofork
with the variables as follows:
bash-5.1$ cat /etc/sysconfig/clustercheck
MYSQL_USERNAME=clustercheck
MYSQL_PASSWORD='PASSWORD_GOES_HERE'
MYSQL_HOST=localhost
TRIPLEO_SOCAT_BIND='tcp4-listen:9200,bind="LOCAL_BIND_IP_GOES_HERE",reuseaddr,fork'
TRIPLEO_POST_STATUS_WAIT=0
To eliminate as many possible causes we removed podman/kolla and mysql from the equation by creating a runner-script:
[root@controller03 cctest]# cat runner.sh
#!/bin/bash
TRIPLEO_SOCAT_BIND='tcp6-listen:9200,bind="[IP_GOES_HERE]",reuseaddr,fork'
TRIPLEO_POST_STATUS_WAIT=0
socat "$TRIPLEO_SOCAT_BIND" system:"./responder.sh; sleep 0",nofork
and a responder-script:
[root@controller03 cctest]# cat responder.sh
#!/bin/bash
echo -en "HTTP/1.1 200 OK\r\n"
echo -en "Content-Type: text/plain\r\n"
echo -en "Connection: close\r\n"
echo -en "Content-Length: 32\r\n"
echo -en "\r\n"
echo -en "Galera cluster node is synced.\r\n"
sleep 0.1
exit 0
To ensure that we also got as many network components out of the way we moved to IPv6 on a network that was not part of the OVN-stack but instead a plain network directly on NICs.
Since our networks are all bonds across two devices we also shut down one device per bond and tcpdumped on the en*-interface directly.
We were still able to reproduce the issue in a reliable fashion.
Finally - to figure out if we were missing anything here - we mocked up a tiny python-server that does exactly the same thing socat/clustercheck are doing:
import socket
import threading
- Define the response headers and content
response_headers = """\
HTTP/1.1 200 OK
Content-Type: text/plain
Connection: close
Content-Length: 32
"""
response_content = "Galera cluster node is synced.\r\n"
- Define the server socket
server_socket = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_socket.bind(('aaaa:aaaa:aaaa:aaa::', 9200))
server_socket.listen(5)
print("Server listening on port 9200...")
def handle_client(client_socket):
"""
Handle incoming client connections and send the response.
"""
request = client_socket.recv(1024)
if request.startswith(b'GET'):
response = response_headers.encode() + response_content.encode()
client_socket.sendall(response)
client_socket.close()
while True:
client_socket, client_address = server_socket.accept()
client_thread = threading.Thread(target=handle_client, args=(client_socket,))
client_thread.start()
and subjected that to the same hammering we threw against the socat-solution.
While socat reliably failed in under 30 minutes the python-solution has been running for 20 hours without a single fail.
I will attach two tcpdumps from the DST/"Server".
One will contain two tcp-streams, with tcp.stream eq 0being a "working" conversation between client and server, and tcp.stream eq 1 being a broken conversation that the server aborted.
The other one contains a tcp-stream from the python-solution for reference, for all we can tell that's not only reliably working but also a lot cleaner since there's no "RST,ACK" in there at all - we assume those are caused by socat being forcefully closed when the bash-script terminates.
Our assumption here is that socat is not really suited for this kind of application. We suspect that it's some kind of threading-thing, but we lack the means to debug/prove that.
- external trackers