-
Bug
-
Resolution: Done-Errata
-
Major
-
4.14.z
Description of problem:
Some Application PODs are getting SERVFAIL errors from CoreDNS for hosts with answer play-load higher than 512 bytes, which is the default buffer size of CoreDNS with MicroShift 4.14.RC.2 (at least).
Such pay-loads could be observed with hosts like `login.microsoftonline.com` which includes multiple IPv4 and IPv6 servers.
In the namespaces/openshift-dns/pods/dns-default-* POD we do observe the following error:
$ grep -i "overflowing header size" sosreport-host0-masked-2023-10-18-id.tar.xz/sosreport-host0-masked-2023-10-18-id/sos_commands/microshift/namespaces/openshift-dns/pods/dns-default-idddd/dns/dns/logs/current.log|sort|head -2 2023-10-18T<timestamp> [ERROR] plugin/errors: 2 <host.example.com>. A: dns: overflowing header size 2023-10-18T<timestamp> [ERROR] plugin/errors: 2 <host.example.com>. AAAA: dns: overflowing header size
In the application pods, while trying to resolve hosts with big DNS answer respones, we do see the following output:
<Application POD> $ nslookup <host.example.com> Server: 10.43.0.10 Address: 10.43.0.10:53Non-authoritative answer: login.microsoftonline.com canonical name = <host2.example.com> <host2.example.com> canonical name = <host3.example.com> <host3.example.com> canonical name = <host4.example.com> Name: <host4.example.com> Address: 192.168.1.122 Name: <host4.example.com> Address: 192.168.1.121 Name: <host4.example.com> Address: 40.126.29.7 (..) Address: 40.126.29.9 Name: <host4.example.com> Address: 192.168.1.123 Name: <host4.example.com> Address: 192.168.1.124*** Can't find <host.example.com>: No answer
In the pcaps, collected at host level with `registry.redhat.io/rhel8/support-tools` image and `tcpdump -i any port 53 -vvv -w <pcap file>`, we do observe the following errors:
$ tshark -r $pcap -Y 'dns.flags.rcode != 0 and dns.qry.name == login.microsoftonline.com' -T fields -e dns.qry.name -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info
login.microsoftonline.com 90956 Oct 18, 2023 13:15:00.139190000 UTC 10.43.0.10 10.42.0.28 Standard query response 0xc904 Server failure AAAA login.microsoftonline.com
login.microsoftonline.com 145804 Oct 18, 2023 14:08:02.641767000 UTC 10.43.0.10 10.42.0.28 Standard query response 0xc904 Server failure AAAA login.microsoftonline.com
login.microsoftonline.com 185732 Oct 18, 2023 14:08:05.139739000 UTC 10.43.0.10 10.42.0.28 Standard query response 0xc904 Server failure AAAA login.microsoftonline.com
Zoom in into that specific frame range, we can see the `Server failure` errors coming from coreDNS `10.43.0.10` and `NS <Root>` from dns Service IP `10.42.0.9 and related upstream DNS server 192.168.122.1:
$ tshark -r $pcap -Y 'frame.number >= 90956 and frame.number <= 91056' -T fields -e dns.qry.name -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info
login.microsoftonline.com 90956 Oct 18, 2023 13:15:00.139190000 UTC 10.43.0.10 10.42.0.28 Standard query response 0xc904 Server failure AAAA login.microsoftonline.com
login.microsoftonline.com 90957 Oct 18, 2023 13:15:00.139285000 UTC 192.168.122.1 172.17.115.7 Standard query response 0x1c36 AAAA login.microsoftonline.com CNAME login.mso.msidentity.com CNAME ak.privatelink.msidentity.com CNAME www.tm.ak.prd.aadg.trafficmanager.net AAAA 2603:1036:3000:60::10 AAAA 2603:1037:1:60::2 AAAA 2603:1036:3000:60::17 AAAA 2603:1036:3000:60::e AAAA 2603:1036:3000:60::d AAAA 2603:1036:3000:60::14 AAAA 2603:1036:3000:60::12 AAAA 2603:1037:1:60::
login.microsoftonline.com 90958 Oct 18, 2023 13:15:00.139292000 UTC 192.168.122.1 10.42.0.9 Standard query response 0x1c36 AAAA login.microsoftonline.com CNAME login.mso.msidentity.com CNAME ak.privatelink.msidentity.com CNAME www.tm.ak.prd.aadg.trafficmanager.net AAAA 2603:1036:3000:60::10 AAAA 2603:1037:1:60::2 AAAA 2603:1036:3000:60::17 AAAA 2603:1036:3000:60::e AAAA 2603:1036:3000:60::d AAAA 2603:1036:3000:60::14 AAAA 2603:1036:3000:60::12 AAAA 2603:1037:1:60::
login.microsoftonline.com 90959 Oct 18, 2023 13:15:00.139297000 UTC 192.168.122.1 10.42.0.9 Standard query response 0x1c36 AAAA login.microsoftonline.com CNAME login.mso.msidentity.com CNAME ak.privatelink.msidentity.com CNAME www.tm.ak.prd.aadg.trafficmanager.net AAAA 2603:1036:3000:60::10 AAAA 2603:1037:1:60::2 AAAA 2603:1036:3000:60::17 AAAA 2603:1036:3000:60::e AAAA 2603:1036:3000:60::d AAAA 2603:1036:3000:60::14 AAAA 2603:1036:3000:60::12 AAAA 2603:1037:1:60::
login.microsoftonline.com 90960 Oct 18, 2023 13:15:00.139305000 UTC 10.42.0.28 10.43.0.10 Standard query 0xc904 AAAA login.microsoftonline.com
login.microsoftonline.com 90961 Oct 18, 2023 13:15:00.139354000 UTC 10.42.0.9 192.168.122.1 Standard query 0xaae0 AAAA login.microsoftonline.com
login.microsoftonline.com 90962 Oct 18, 2023 13:15:00.139358000 UTC 10.42.0.9 192.168.122.1 Standard query 0xaae0 AAAA login.microsoftonline.com
login.microsoftonline.com 90963 Oct 18, 2023 13:15:00.139364000 UTC 172.17.115.7 192.168.122.1 Standard query 0xaae0 AAAA login.microsoftonline.com
<Root> 90964 Oct 18, 2023 13:15:00.139409000 UTC 10.42.0.9 192.168.122.1 Standard query 0x24b7 NS <Root>
<Root> 90965 Oct 18, 2023 13:15:00.139413000 UTC 10.42.0.9 192.168.122.1 Standard query 0x24b7 NS <Root>
<Root> 90966 Oct 18, 2023 13:15:00.139418000 UTC 172.17.115.7 192.168.122.1 Standard query 0x24b7 NS <Root>
login.microsoftonline.com 90967 Oct 18, 2023 13:15:00.139549000 UTC 10.42.0.9 192.168.122.1 Standard query 0xb878 AAAA login.microsoftonline.com
login.microsoftonline.com 90968 Oct 18, 2023 13:15:00.139553000 UTC 10.42.0.9 192.168.122.1 Standard query 0xb878 AAAA login.microsoftonline.com
login.microsoftonline.com 90969 Oct 18, 2023 13:15:00.139558000 UTC 172.17.115.7 192.168.122.1 Standard query 0xb878 AAAA login.microsoftonline.com
Version-Release number of selected component (if applicable):
MicroShift 4.14
How reproducible:
It seems that the issue is happening with DNS answers that have more than 512 bytes, but it is not reproducible with all application PODs in the same node. On Red Hat Labs, has been observed with old community images like busybox:1.30.1.
Steps to Reproduce:
1. Install MicroShift 4.14 (RC or GA) 2. Deploy a POD using an old and community image like busybox:1.30.1 3. nslookup login.microsoftonline.com
Actual results:
dns.flags.rcode != 0
Expected results:
dns.flags.rcode == 0
Additional info:
Similar issue observed with coredns for OpenShift: - https://bugzilla.redhat.com/show_bug.cgi?id=1949361 - https://github.com/coredns/coredns/issues/3941 - https://github.com/coredns/coredns/issues/5953
for OpenShift, we do have this KB https://access.redhat.com/solutions/5984291kcs link unlink button from comment that describe how to overrides some configurations..
- blocks
-
OCPBUGS-27855 [MicroShift] SERVFAIL due to "[ERROR] plugin/errors: dns: overflowing header size"
- Closed
- is cloned by
-
OCPBUGS-27855 [MicroShift] SERVFAIL due to "[ERROR] plugin/errors: dns: overflowing header size"
- Closed
- links to
-
RHEA-2024:0043 Red Hat build of MicroShift 4.16.z bug fix and enhancement update