-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
4.11, 4.10
-
None
-
-
-
Critical
-
None
-
SDN Sprint 223, SDN Sprint 224, SDN Sprint 225
-
3
-
Proposed
-
False
-
-
Customer Escalated
Description of problem:
The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1919737 appears to have been undone in 4.10.7. DNS should prefer local endpoints if the local endpoint is ready; however, through the investigation in slack #incident-build-farm-dns-timeouts, we found CI cluster build01 (v4.10.28) and build03 (v4.11.0) exhibiting behavior in which DNS requests are being load balanced among the DNS pods, which means the local DNS endpoint is not being preferred.
This bug is exposing another unrelated issue in #incident-build-farm-dns-timeouts ++ and I have concern that https://bugzilla.redhat.com/show_bug.cgi?id=1919737 has be reintroduced since the solution has regressed.
Version-Release number of selected component (if applicable):
4.10.7 is where I found the issue emerged. 4.11 appears to be impacted, however, my preliminary testing has shown 4.12 is not affected, but please double check my work.
How reproducible:
100% of the time
Steps to Reproduce:
1. Apply the following YAML:
apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: dns-distribution name: dns-distribution spec: selector: matchLabels: app: dns-distribution template: metadata: labels: app: dns-distribution spec: containers: - command: - "/bin/bash" - "-c" - | set -euo pipefail while : ; do echo "Collecting tcpdump for 30 seconds...please wait" tcpdump -i any "udp port 53 or tcp port 53 or udp port 5353 or tcp port 5353" -W 1 -G 30 -w "/tmp/tcpdump.pcap" &> /dev/null tshark -r /tmp/tcpdump.pcap -n -Y 'mdns and dns.flags.response == 0 and not dns.retransmission' -t ud | awk '{print $6}' | uniq -c 2> /dev/null done # oc adm release info --image-for=tools image: quay.io/gspence/tshark name: tcpdump securityContext: privileged: true - command: - "/bin/bash" - "-c" - | set -uo pipefail echo "Starting" while : ; do dig +retry=0 +timeout=60 +tries=1 "https://docs.ci.openshift.org" sleep 0.5 done image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e630fcf3b3a8c3b78e6766eb1e71db69a9ccdae9014e32464390806e74eaca9 name: dig securityContext: privileged: true terminationGracePeriodSeconds: 30 hostNetwork: true nodeSelector: "kubernetes.io/os": "linux" privileged: true tolerations: - operator: Exists
2. Get logs from any dns test pod:
oc logs dns-distribution-<ID> tcpdump
Actual results:
Output on a broken cluster will show a variety of endpoints
2022-08-24T01:30:52.933757215Z Collecting tcpdump for 30 seconds...please wait 2022-08-24T01:31:22.670034160Z Running as user "root" and group "root". This could be dangerous. 2022-08-24T01:31:22.784552757Z 1 10.128.0.11 2022-08-24T01:31:22.784552757Z 1 10.129.2.4 2022-08-24T01:31:22.784552757Z 6 10.131.0.4 2022-08-24T01:31:22.784552757Z 1 10.128.2.3 2022-08-24T01:31:22.784552757Z 1 10.129.0.33 2022-08-24T01:31:22.784552757Z 1 10.128.0.11 2022-08-24T01:31:22.784552757Z 1 10.128.2.3 2022-08-24T01:31:22.784552757Z 2 10.131.0.4 2022-08-24T01:31:22.784552757Z 1 10.129.2.4 2022-08-24T01:31:22.784552757Z 1 10.129.0.33 2022-08-24T01:31:22.784552757Z 1 10.130.0.13 2022-08-24T01:31:22.784552757Z 1 10.129.0.33 2022-08-24T01:31:22.784552757Z 1 10.128.2.3 2022-08-24T01:31:22.784552757Z 1 10.130.0.13 2022-08-24T01:31:22.784552757Z 1 10.128.2.3 2022-08-24T01:31:22.784552757Z 1 10.129.0.33 2022-08-24T01:31:22.784552757Z 1 10.129.2.4 2022-08-24T01:31:22.784552757Z 1 10.131.0.4 2022-08-24T01:31:22.784552757Z 1 10.128.0.11 2022-08-24T01:31:22.784552757Z 1 10.128.2.3 2022-08-24T01:31:22.784552757Z 1 10.129.2.4 2022-08-24T01:31:22.784552757Z 1 10.128.2.3 2022-08-24T01:31:22.784552757Z 1 10.130.0.13 2022-08-24T01:31:22.784552757Z 3 10.131.0.4
Expected results:
Clusters that work don't have multiple endpoints:
2022-08-24T02:07:12.085116749Z Collecting tcpdump for 30 seconds...please wait 2022-08-24T02:07:42.872453203Z Running as user "root" and group "root". This could be dangerous. 2022-08-24T02:07:42.994361420Z 43 10.129.0.33 2022-08-24T02:07:42.999296355Z Collecting tcpdump for 30 seconds...please wait 2022-08-24T02:08:13.777908559Z Running as user "root" and group "root". This could be dangerous. 2022-08-24T02:08:13.901421167Z 48 10.129.0.33
Additional info:
Please reach out for any more details. If I do further bisecting and post information here.
Linked to test: [sig-trt] no DNS lookup errors should be encountered in disruption samplers
- causes
-
NE-1067 [Tech Debt] Create E2E testing for DNS local endpoint preference
- Closed
- duplicates
-
OCPBUGS-668 Prefer local dns does not work expectedly on OCPv4.11
- Closed
-
OCPBUGS-670 Prefer local dns does not work expectedly on OCPv4.12
- Closed
- is related to
-
OCPBUGS-9985 TCP DNS Local Preference is not working for Openshift SDN
- Closed