Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: socat
Labels:
- Triaged

Regression:
None
Severity:
Moderate

AssignedTeam:
rhel-stacks-services-scripting
Sub-System Group:

ssg_core_services

Story Points:
None
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Experience:
Bugzilla Bug:
RHBZ: 2298210

PX Impact Score:
PX Technical Impact:
PX Impact Range:
PX Review Complete:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

galera socat cluster monitoring is flaky and here's a complete run-down of what the customer found and how:

We saw the DOWNs in haproxy.log, did not understand where they were coming from since everything including mysql was stable.

We started hammering the endpoints using

time while true; do if curl -s http://REMOTE_HOST_GOES_HERE:9200/ |egrep -q 'is sync'; then sleep 0.01s; else echo failed; break; fi; done
and tcpdumped on both SRC and DST.

The tcpdumps we got indicated that the DST / "Server" was sending "ACK", "PSH ACK" and "RST ACK" in short succession without anything in between from the SRC/"Client".

The construct that's "serving" clustercheck on RHOSP17 looks like this:

bash-5.1$ cat /run_command
bash -c $* – eval source /etc/sysconfig/clustercheck; exec socat "$TRIPLEO_SOCAT_BIND" system:"/usr/bin/clustercheck; sleep '${TRIPLEO_POST_STATUS_WAIT:-0}'",nofork
with the variables as follows:

bash-5.1$ cat /etc/sysconfig/clustercheck
MYSQL_USERNAME=clustercheck

MYSQL_PASSWORD='PASSWORD_GOES_HERE'

MYSQL_HOST=localhost

TRIPLEO_SOCAT_BIND='tcp4-listen:9200,bind="LOCAL_BIND_IP_GOES_HERE",reuseaddr,fork'

TRIPLEO_POST_STATUS_WAIT=0
To eliminate as many possible causes we removed podman/kolla and mysql from the equation by creating a runner-script:

[root@controller03 cctest]# cat runner.sh
#!/bin/bash
TRIPLEO_SOCAT_BIND='tcp6-listen:9200,bind="[IP_GOES_HERE]",reuseaddr,fork'
TRIPLEO_POST_STATUS_WAIT=0
socat "$TRIPLEO_SOCAT_BIND" system:"./responder.sh; sleep 0",nofork
and a responder-script:

[root@controller03 cctest]# cat responder.sh
#!/bin/bash
echo -en "HTTP/1.1 200 OK\r\n"
echo -en "Content-Type: text/plain\r\n"
echo -en "Connection: close\r\n"
echo -en "Content-Length: 32\r\n"
echo -en "\r\n"
echo -en "Galera cluster node is synced.\r\n"
sleep 0.1
exit 0
To ensure that we also got as many network components out of the way we moved to IPv6 on a network that was not part of the OVN-stack but instead a plain network directly on NICs.
Since our networks are all bonds across two devices we also shut down one device per bond and tcpdumped on the en*-interface directly.

We were still able to reproduce the issue in a reliable fashion.

Finally - to figure out if we were missing anything here - we mocked up a tiny python-server that does exactly the same thing socat/clustercheck are doing:

import socket
import threading

Define the response headers and content
response_headers = """\
HTTP/1.1 200 OK
Content-Type: text/plain
Connection: close
Content-Length: 32

"""
response_content = "Galera cluster node is synced.\r\n"

Define the server socket
server_socket = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_socket.bind(('aaaa:aaaa:aaaa:aaa::', 9200))
server_socket.listen(5)
print("Server listening on port 9200...")

def handle_client(client_socket):
"""
Handle incoming client connections and send the response.
"""
request = client_socket.recv(1024)
if request.startswith(b'GET'):
response = response_headers.encode() + response_content.encode()
client_socket.sendall(response)
client_socket.close()

while True:
client_socket, client_address = server_socket.accept()
client_thread = threading.Thread(target=handle_client, args=(client_socket,))
client_thread.start()
and subjected that to the same hammering we threw against the socat-solution.
While socat reliably failed in under 30 minutes the python-solution has been running for 20 hours without a single fail.

I will attach two tcpdumps from the DST/"Server".
One will contain two tcp-streams, with tcp.stream eq 0being a "working" conversation between client and server, and tcp.stream eq 1 being a broken conversation that the server aborted.
The other one contains a tcp-stream from the python-solution for reference, for all we can tell that's not only reliably working but also a lot cleaner since there's no "RST,ACK" in there at all - we assume those are caused by socat being forcefully closed when the bash-script terminates.

Our assumption here is that socat is not really suited for this kind of application. We suspect that it's some kind of threading-thing, but we lack the means to debug/prove that.

external trackers

Red Hat Customer Portal 03877717

Assignee:: Martin Osvald

Reporter:: RH Bugzilla Integration

Developer:: Martin Osvald

QA Contact:: Frantisek Hrdina

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/07/16 1:38 PM

Updated:: 2025/09/13 7:23 PM

Resolved:: 2025/02/25 10:25 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates