Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29372

[release-4.14] [MicroShift] SERVFAIL due to "[ERROR] plugin/errors: dns: overflowing header size"

XMLWordPrintable

    • Important
    • No
    • 1
    • uShift Sprint 249
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the CoreDNS bufsize setting was configured as 512 bytes. With this release, the maximum size of the buffer for MicroShift CoreDNS is 1232 bytes. This modification enhances DNS performance by reducing the occurrence of DNS truncations and retries.
      Show
      * Previously, the CoreDNS bufsize setting was configured as 512 bytes. With this release, the maximum size of the buffer for MicroShift CoreDNS is 1232 bytes. This modification enhances DNS performance by reducing the occurrence of DNS truncations and retries.
    • Enhancement
    • In Progress

      This is a clone of issue OCPBUGS-21901. The following is the description of the original issue:

      Description of problem:
      Some Application PODs are getting SERVFAIL errors from CoreDNS for hosts with answer play-load higher than 512 bytes, which is the default buffer size of CoreDNS with MicroShift 4.14.RC.2 (at least). 

      Such pay-loads could be observed with hosts like `login.microsoftonline.com` which includes multiple IPv4 and IPv6 servers. 

      In the namespaces/openshift-dns/pods/dns-default-* POD we do observe the following error: 

       

      $ grep -i "overflowing header size" sosreport-host0-masked-2023-10-18-id.tar.xz/sosreport-host0-masked-2023-10-18-id/sos_commands/microshift/namespaces/openshift-dns/pods/dns-default-idddd/dns/dns/logs/current.log|sort|head -2
      2023-10-18T<timestamp> [ERROR] plugin/errors: 2 <host.example.com>. A: dns: overflowing header size
      2023-10-18T<timestamp> [ERROR] plugin/errors: 2 <host.example.com>. AAAA: dns: overflowing header size 

      In the application pods, while trying to resolve hosts with big DNS answer respones, we do see the following output: 

      <Application POD> $ nslookup <host.example.com>
      Server:         10.43.0.10 
      Address:        10.43.0.10:53Non-authoritative answer:
      login.microsoftonline.com       canonical name = <host2.example.com>
      <host2.example.com>        canonical name = <host3.example.com>
      <host3.example.com>   canonical name = <host4.example.com>
      Name:   <host4.example.com>
      Address: 192.168.1.122
      Name:   <host4.example.com>
      Address: 192.168.1.121
      Name:   <host4.example.com>
      Address: 40.126.29.7
      (..)
      Address: 40.126.29.9
      Name:   <host4.example.com>
      Address: 192.168.1.123
      Name:   <host4.example.com>
      Address: 192.168.1.124*** Can't find <host.example.com>: No answer 

      In the pcaps, collected at host level with `registry.redhat.io/rhel8/support-tools` image and `tcpdump -i any port 53 -vvv -w <pcap file>`, we do observe the following errors: 

       

      $ tshark -r $pcap -Y 'dns.flags.rcode != 0 and dns.qry.name == login.microsoftonline.com' -T fields -e dns.qry.name -e frame.number -e frame.time -e ip.src -e ip.dst -e  _ws.col.Info
      login.microsoftonline.com    90956    Oct 18, 2023 13:15:00.139190000 UTC    10.43.0.10    10.42.0.28    Standard query response 0xc904 Server failure AAAA login.microsoftonline.com
      login.microsoftonline.com    145804    Oct 18, 2023 14:08:02.641767000 UTC    10.43.0.10    10.42.0.28    Standard query response 0xc904 Server failure AAAA login.microsoftonline.com
      login.microsoftonline.com    185732    Oct 18, 2023 14:08:05.139739000 UTC    10.43.0.10    10.42.0.28    Standard query response 0xc904 Server failure AAAA login.microsoftonline.com 

      Zoom in into that specific frame range, we can see the `Server failure` errors coming from coreDNS `10.43.0.10` and `NS <Root>` from dns Service IP `10.42.0.9 and related upstream DNS server 192.168.122.1: 

      $ tshark -r $pcap -Y 'frame.number >= 90956 and frame.number <= 91056' -T fields -e dns.qry.name -e frame.number -e frame.time -e ip.src -e ip.dst -e _ws.col.Info
      login.microsoftonline.com    90956    Oct 18, 2023 13:15:00.139190000 UTC    10.43.0.10    10.42.0.28    Standard query response 0xc904 Server failure AAAA login.microsoftonline.com
      login.microsoftonline.com    90957    Oct 18, 2023 13:15:00.139285000 UTC    192.168.122.1    172.17.115.7    Standard query response 0x1c36 AAAA login.microsoftonline.com CNAME login.mso.msidentity.com CNAME ak.privatelink.msidentity.com CNAME www.tm.ak.prd.aadg.trafficmanager.net AAAA 2603:1036:3000:60::10 AAAA 2603:1037:1:60::2 AAAA 2603:1036:3000:60::17 AAAA 2603:1036:3000:60::e AAAA 2603:1036:3000:60::d AAAA 2603:1036:3000:60::14 AAAA 2603:1036:3000:60::12 AAAA 2603:1037:1:60::
      login.microsoftonline.com    90958    Oct 18, 2023 13:15:00.139292000 UTC    192.168.122.1    10.42.0.9    Standard query response 0x1c36 AAAA login.microsoftonline.com CNAME login.mso.msidentity.com CNAME ak.privatelink.msidentity.com CNAME www.tm.ak.prd.aadg.trafficmanager.net AAAA 2603:1036:3000:60::10 AAAA 2603:1037:1:60::2 AAAA 2603:1036:3000:60::17 AAAA 2603:1036:3000:60::e AAAA 2603:1036:3000:60::d AAAA 2603:1036:3000:60::14 AAAA 2603:1036:3000:60::12 AAAA 2603:1037:1:60::
      login.microsoftonline.com    90959    Oct 18, 2023 13:15:00.139297000 UTC    192.168.122.1    10.42.0.9    Standard query response 0x1c36 AAAA login.microsoftonline.com CNAME login.mso.msidentity.com CNAME ak.privatelink.msidentity.com CNAME www.tm.ak.prd.aadg.trafficmanager.net AAAA 2603:1036:3000:60::10 AAAA 2603:1037:1:60::2 AAAA 2603:1036:3000:60::17 AAAA 2603:1036:3000:60::e AAAA 2603:1036:3000:60::d AAAA 2603:1036:3000:60::14 AAAA 2603:1036:3000:60::12 AAAA 2603:1037:1:60::
      login.microsoftonline.com    90960    Oct 18, 2023 13:15:00.139305000 UTC    10.42.0.28    10.43.0.10    Standard query 0xc904 AAAA login.microsoftonline.com
      login.microsoftonline.com    90961    Oct 18, 2023 13:15:00.139354000 UTC    10.42.0.9    192.168.122.1    Standard query 0xaae0 AAAA login.microsoftonline.com
      login.microsoftonline.com    90962    Oct 18, 2023 13:15:00.139358000 UTC    10.42.0.9    192.168.122.1    Standard query 0xaae0 AAAA login.microsoftonline.com
      login.microsoftonline.com    90963    Oct 18, 2023 13:15:00.139364000 UTC    172.17.115.7    192.168.122.1    Standard query 0xaae0 AAAA login.microsoftonline.com
      <Root>    90964    Oct 18, 2023 13:15:00.139409000 UTC    10.42.0.9    192.168.122.1    Standard query 0x24b7 NS <Root>
      <Root>    90965    Oct 18, 2023 13:15:00.139413000 UTC    10.42.0.9    192.168.122.1    Standard query 0x24b7 NS <Root>
      <Root>    90966    Oct 18, 2023 13:15:00.139418000 UTC    172.17.115.7    192.168.122.1    Standard query 0x24b7 NS <Root>
      login.microsoftonline.com    90967    Oct 18, 2023 13:15:00.139549000 UTC    10.42.0.9    192.168.122.1    Standard query 0xb878 AAAA login.microsoftonline.com
      login.microsoftonline.com    90968    Oct 18, 2023 13:15:00.139553000 UTC    10.42.0.9    192.168.122.1    Standard query 0xb878 AAAA login.microsoftonline.com
      login.microsoftonline.com    90969    Oct 18, 2023 13:15:00.139558000 UTC    172.17.115.7    192.168.122.1    Standard query 0xb878 AAAA login.microsoftonline.com 

       

      Version-Release number of selected component (if applicable):

      MicroShift 4.14

      How reproducible:

      It seems that the issue is happening with DNS answers that have more than 512 bytes, but it is not reproducible with all application PODs in the same node. On Red Hat Labs, has been observed with old community images like busybox:1.30.1.

      Steps to Reproduce:

      1. Install MicroShift 4.14 (RC or GA)
      2. Deploy a POD using an old and community image like busybox:1.30.1
      3. nslookup login.microsoftonline.com 

      Actual results:

      dns.flags.rcode != 0

      Expected results:

      dns.flags.rcode == 0

      Additional info:

      Similar issue observed with coredns for OpenShift:
      - https://bugzilla.redhat.com/show_bug.cgi?id=1949361
      - https://github.com/coredns/coredns/issues/3941
      - https://github.com/coredns/coredns/issues/5953
      for OpenShift, we do have this KB https://access.redhat.com/solutions/5984291kcs link unlink button from comment that describe how to overrides some configurations..

              eslutsky Evgeny Slutsky
              openshift-crt-jira-prow OpenShift Prow Bot
              Douglas Hensel Douglas Hensel
              Shauna Diaz Shauna Diaz
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: