Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-488

Openshift SDN preference for local DNS endpoint not working

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 4.11, 4.10
    • None
    • -
    • Critical
    • SDN Sprint 223, SDN Sprint 224, SDN Sprint 225
    • 3
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

    Description

      Description of problem:

      The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1919737 appears to have been undone in 4.10.7. DNS should prefer local endpoints if the local endpoint is ready; however, through the investigation in slack #incident-build-farm-dns-timeouts, we found CI cluster build01 (v4.10.28) and build03 (v4.11.0) exhibiting behavior in which DNS requests are being load balanced among the DNS pods, which means the local DNS endpoint is not being preferred.

      This bug is exposing another unrelated issue in #incident-build-farm-dns-timeouts ++ and I have concern that https://bugzilla.redhat.com/show_bug.cgi?id=1919737 has be reintroduced since the solution has regressed.

      Version-Release number of selected component (if applicable):

      4.10.7 is where I found the issue emerged. 4.11 appears to be impacted, however, my preliminary testing has shown 4.12 is not affected, but please double check my work.

      How reproducible:

      100% of the time

      Steps to Reproduce:
      1. Apply the following YAML:

      apiVersion: apps/v1
      kind: DaemonSet
      metadata:
        labels:
          app: dns-distribution
        name: dns-distribution
      spec:
        selector:
          matchLabels:
            app: dns-distribution
        template:
          metadata:
            labels:
              app: dns-distribution
          spec:
            containers:
            - command:
              - "/bin/bash"
              - "-c"
              - |
                set -euo pipefail
                while : ; do
                  echo "Collecting tcpdump for 30 seconds...please wait"
                  tcpdump -i any "udp port 53 or tcp port 53 or udp port 5353 or tcp port 5353" -W 1 -G 30 -w "/tmp/tcpdump.pcap" &> /dev/null
                  tshark -r /tmp/tcpdump.pcap -n -Y 'mdns and dns.flags.response == 0 and not dns.retransmission' -t ud | awk '{print $6}' | uniq -c 2> /dev/null
                done
              # oc adm release info --image-for=tools
              image: quay.io/gspence/tshark
              name: tcpdump
              securityContext:
                privileged: true
            - command:
              - "/bin/bash"
              - "-c"
              - |
                set -uo pipefail
                echo "Starting"
                while : ; do
                  dig +retry=0 +timeout=60 +tries=1 "https://docs.ci.openshift.org"
                  sleep 0.5
                done
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e630fcf3b3a8c3b78e6766eb1e71db69a9ccdae9014e32464390806e74eaca9
              name: dig
              securityContext:
                privileged: true
            terminationGracePeriodSeconds: 30
            hostNetwork: true
            nodeSelector:
              "kubernetes.io/os": "linux"
            privileged: true
            tolerations:
            - operator: Exists

      2. Get logs from any dns test pod: 

      oc logs dns-distribution-<ID> tcpdump 

      Actual results:

       

      Output on a broken cluster will show a variety of endpoints

       

      2022-08-24T01:30:52.933757215Z Collecting tcpdump for 30 seconds...please wait
      2022-08-24T01:31:22.670034160Z Running as user "root" and group "root". This could be dangerous.
      2022-08-24T01:31:22.784552757Z       1 10.128.0.11
      2022-08-24T01:31:22.784552757Z       1 10.129.2.4
      2022-08-24T01:31:22.784552757Z       6 10.131.0.4
      2022-08-24T01:31:22.784552757Z       1 10.128.2.3
      2022-08-24T01:31:22.784552757Z       1 10.129.0.33
      2022-08-24T01:31:22.784552757Z       1 10.128.0.11
      2022-08-24T01:31:22.784552757Z       1 10.128.2.3
      2022-08-24T01:31:22.784552757Z       2 10.131.0.4
      2022-08-24T01:31:22.784552757Z       1 10.129.2.4
      2022-08-24T01:31:22.784552757Z       1 10.129.0.33
      2022-08-24T01:31:22.784552757Z       1 10.130.0.13
      2022-08-24T01:31:22.784552757Z       1 10.129.0.33
      2022-08-24T01:31:22.784552757Z       1 10.128.2.3
      2022-08-24T01:31:22.784552757Z       1 10.130.0.13
      2022-08-24T01:31:22.784552757Z       1 10.128.2.3
      2022-08-24T01:31:22.784552757Z       1 10.129.0.33
      2022-08-24T01:31:22.784552757Z       1 10.129.2.4
      2022-08-24T01:31:22.784552757Z       1 10.131.0.4
      2022-08-24T01:31:22.784552757Z       1 10.128.0.11
      2022-08-24T01:31:22.784552757Z       1 10.128.2.3
      2022-08-24T01:31:22.784552757Z       1 10.129.2.4
      2022-08-24T01:31:22.784552757Z       1 10.128.2.3
      2022-08-24T01:31:22.784552757Z       1 10.130.0.13
      2022-08-24T01:31:22.784552757Z       3 10.131.0.4

       

      Expected results:

      Clusters that work don't have multiple endpoints:

       

      2022-08-24T02:07:12.085116749Z Collecting tcpdump for 30 seconds...please wait
      2022-08-24T02:07:42.872453203Z Running as user "root" and group "root". This could be dangerous.
      2022-08-24T02:07:42.994361420Z      43 10.129.0.33
      2022-08-24T02:07:42.999296355Z Collecting tcpdump for 30 seconds...please wait
      2022-08-24T02:08:13.777908559Z Running as user "root" and group "root". This could be dangerous.
      2022-08-24T02:08:13.901421167Z      48 10.129.0.33 

       

       

      Additional info:

      Slack thread

      Please reach out for any more details. If I do further bisecting and post information here.

      Linked to test: [sig-trt] no DNS lookup errors should be encountered in disruption samplers
       

      Attachments

        Issue Links

          Activity

            People

              mkennell@redhat.com Martin Kennelly
              gspence@redhat.com Grant Spence
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: