-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.z, 4.19.z, 4.20.z, 4.21.0
-
Quality / Stability / Reliability
-
True
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
contract-priority
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Summary:
The haproxy pod became unable to query dns (udp communication failure) after deleted/recreated dns-default-xxx pods.
This issue is resolved after deleting conntrack record of the worker node where haproxy is running, this record has an old(already deleted) dns-default pod IP addres
Details:
dns-pod was deleted
quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-86409e992e7e8323656a6a25ad2aead3cd82beea55680af1fd65c00a73a6abc4/host_service_logs/masters/kubelet_service.log:Jun 02 05:29:56.984844 etcd-1.paas.tmg.local kubenswrapper[2396]: I0602 05:29:56.984780 2396 kubelet.go:2449] "SyncLoop DELETE" source="api" pods=[openshift-dns/dns-default-jvbrs] quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-86409e992e7e8323656a6a25ad2aead3cd82beea55680af1fd65c00a73a6abc4/host_service_logs/masters/kubelet_service.log:Jun 02 05:29:56.988325 etcd-1.paas.tmg.local kubenswrapper[2396]: I0602 05:29:56.988269 2396 kubelet.go:2443] "SyncLoop REMOVE" source="api" pods=[openshift-dns/dns-default-jvbrs]
But After that, the following contrack record is still existed
udp 17 119 src=10.130.2.168 dst=172.30.0.10 sport=44882 dport=53 src=10.129.0.106 dst=10.130.2.168 sport=5353 dport=44882 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 zone=20 use=1
- 10.130.2.168 is The IP address of the haproxy-0 pod
- 172.30.0.10 is The DNS ServiceIP
- 10.129.0.106 is the IP address of the is the old(already deleted) DNS pod (dns-default-jvbrs)
It seems that a stale conntrack UDP entry exists and I am causing DNS query failure. This issue is resolved after deleting old conntrack records.
In my understanding old conntrack record will be deleted after 180s passed if there is not any active communications. But in this case, The old conntrack record is keep having.
I am assuming that "Due to haproxy, DNS queries are frequent, and packets are continuously sent to old entries, causing the connection tracking retention time to be updated."
I found the this Jira[1], It seems to be talking the same issue but this Jira is for the SDN component not for ovn-kubernetes.
In this Jira[1], It was said that "It seems the issue does not exist with OVN (Openshift 4.16.13)"
But it seems that It occurs less frequently than SDN and is difficult to reproduce, but it does occur in reality.
It seems udp conntrack cleanup logic is the same as 4.14[2] and 4.16[3].
(Even if newer versions such as 4.18, It is the same. If my confirmation point is not correct, I am sorry.)
[1] https://issues.redhat.com/browse/OCPBUGS-42203
Version-Release number of selected component (if applicable):
Openshift 4.14.51
openshift4/ose-ovn-kubernetes@sha256:9c1407542398da5dda6c7c335c36221ba7c78df70c3d90c182b7f8e2eb4e0c91
How reproducible:
Not always but sometimes (25% - when executing Reproduce steps In customer env).
Steps to Reproduce:
1. Created haproxy pods in advance.
2. Deleted the dns pods and wait for completing to recreate.
3. Excuting the operation to occur dns query via haproxy pods.
Actual results:
old conntrack record existed then DNS resolution failure occurred
Expected results:
DNS resolution should continue
Affected Platforms:
None