Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9985

TCP DNS Local Preference is not working for Openshift SDN

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.14.0
    • 4.13, 4.12, 4.11
    • None
    • Critical
    • No
    • SDN Sprint 233, SDN Sprint 234, SDN Sprint 235
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, transmission control protocol (TCP) connections were load balanced across all DNS. With this update, TCP connections are enabled to prefer local DNS endpoints. (link:https://issues.redhat.com/browse/OCPBUGS-9985[*OCPBUGS-9985*])
      Show
      * Previously, transmission control protocol (TCP) connections were load balanced across all DNS. With this update, TCP connections are enabled to prefer local DNS endpoints. (link: https://issues.redhat.com/browse/OCPBUGS-9985 [* OCPBUGS-9985 *])
    • Bug Fix
    • Done

      Description of problem:

      DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN.
      
      Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012
      
      This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug.
      
      UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.

      Version-Release number of selected component (if applicable):

      4.13, 4.12, 4.11

      How reproducible:

      100%

      Steps to Reproduce:

      1. oc debug -n openshift-dns
      2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      # Retry multiple times, and you should always get the same local DNS pod.

      Actual results:

      [gspence@gspence origin]$ oc debug -n openshift-dns
      Starting pod/image-debug ...
      Pod IP: 10.128.2.10
      If you don't see a command prompt, try pressing enter.
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-glgr8"
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-gzlhm"
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-dnbsp"
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-gzlhm"
      

      Expected results:

      [gspence@gspence origin]$ oc debug -n openshift-dns
      Starting pod/image-debug ...
      Pod IP: 10.128.2.10
      If you don't see a command prompt, try pressing enter.
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-glgr8"
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-glgr8"
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-glgr8"
      sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
      "dns-default-glgr8" 

      Additional info:

      https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working.
      
      iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing 

              mkennell@redhat.com Martin Kennelly
              gspence@redhat.com Grant Spence
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: