Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57053

[OCP4.14][OVN-Kubernetes] stale conntrack UDP entry when deleting openshift-dns pod

XMLWordPrintable

    • Quality / Stability / Reliability
    • True
    • Hide

      Red Hat

      Show
      Red Hat
    • None
    • Important
    • None
    • None
    • None
    • None
    • contract-priority
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Summary:
      The haproxy pod became unable to query dns (udp communication failure) after deleted/recreated dns-default-xxx pods.
      This issue is resolved after deleting conntrack record of the worker node where haproxy is running, this record has an old(already deleted) dns-default pod IP addres

      Details:
      dns-pod was deleted 

      quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-86409e992e7e8323656a6a25ad2aead3cd82beea55680af1fd65c00a73a6abc4/host_service_logs/masters/kubelet_service.log:Jun 02 05:29:56.984844 etcd-1.paas.tmg.local kubenswrapper[2396]: I0602 05:29:56.984780    2396 kubelet.go:2449] "SyncLoop DELETE" source="api" pods=[openshift-dns/dns-default-jvbrs]
      quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-86409e992e7e8323656a6a25ad2aead3cd82beea55680af1fd65c00a73a6abc4/host_service_logs/masters/kubelet_service.log:Jun 02 05:29:56.988325 etcd-1.paas.tmg.local kubenswrapper[2396]: I0602 05:29:56.988269    2396 kubelet.go:2443] "SyncLoop REMOVE" source="api" pods=[openshift-dns/dns-default-jvbrs]

       

      But After that, the following contrack record is still existed

      udp      17 119 src=10.130.2.168 dst=172.30.0.10 sport=44882 dport=53 src=10.129.0.106 dst=10.130.2.168 sport=5353 dport=44882 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 zone=20 use=1
      • 10.130.2.168 is The IP address of the haproxy-0 pod
      • 172.30.0.10 is The DNS ServiceIP
      • 10.129.0.106 is the IP address of the is the old(already deleted) DNS pod (dns-default-jvbrs)

      It seems that a stale conntrack UDP entry exists and I am causing DNS query failure. This issue is resolved after deleting old conntrack records.

      In my understanding old conntrack record will be deleted after 180s passed if there is not any active communications. But in this case, The old conntrack record is keep having.

      I am assuming that  "Due to haproxy, DNS queries are frequent, and packets are continuously sent to old entries, causing the connection tracking retention time to be updated."

      I found the this Jira[1], It seems to be talking the same issue but this Jira is for the SDN component not for ovn-kubernetes.

      In this Jira[1], It was said that "It seems the issue does not exist with OVN (Openshift 4.16.13)"
      But it seems that It occurs less frequently than SDN and is difficult to reproduce, but it does occur in reality.

      It seems udp conntrack cleanup logic is the same as 4.14[2] and 4.16[3].
      (Even if newer versions such as 4.18, It is the same. If my confirmation point is not correct, I am sorry.)

      [1] https://issues.redhat.com/browse/OCPBUGS-42203

      [2] https://github.com/openshift/ovn-kubernetes/blob/release-4.14/go-controller/pkg/node/default_node_network_controller.go#L1221-L1258

      [3] https://github.com/openshift/ovn-kubernetes/blob/release-4.16/go-controller/pkg/node/default_node_network_controller.go#L1179-L1216

       


      Version-Release number of selected component (if applicable):
      Openshift 4.14.51
      openshift4/ose-ovn-kubernetes@sha256:9c1407542398da5dda6c7c335c36221ba7c78df70c3d90c182b7f8e2eb4e0c91

      How reproducible:
      Not always but sometimes (25% - when executing Reproduce steps In customer env).

      Steps to Reproduce:

      1. Created haproxy  pods in advance.

      2. Deleted the dns pods and wait for completing to recreate.

      3. Excuting the operation to occur dns query via haproxy pods.

      Actual results:
      old conntrack record existed then DNS resolution failure occurred 

      Expected results:
      DNS resolution should continue 

      Affected Platforms:
      None

              pliurh Peng Liu
              rhn-support-tsaito Takeshi Saito
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated: