Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-28200

SRV lookup is failing after OpenShift Container Platform 4.13 update because of CoreDNS version 1.10.1

    XMLWordPrintable

Details

    • Critical
    • No
    • 2
    • Sprint 248
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, upgrading {product-title} can lead to DNS quieries failing due to upstream returning a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1. With this release, clusters with a non-compliant upstream will retry with TCP upon overflow errors which will prevent disruption of function when upgrading. (link:https://issues.redhat.com/browse/OCPBUGS-28200[*OCPBUGS-28200*])

      //original bug text below.
      *Cause*: The Azure upstream DNS doesn't comply with non-EDNS DNS queries as it returns a payload larger than 512 bytes. Additionally, CoreDNS 1.10.1 no longer always uses EDNS for upstream queries; it only uses EDNS when the original client query used EDNS. This combination results in an overflow error and SERVFAIL if upstream returns a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1.
      *Consequence*: Upgrading from OCP 4.12 to 4.13 can lead to some DNS queries failing, which previously worked, as CoreDNS 1.10.1 can now query upstream without EDNS, exposing the non-compliant scenario in Azure upstream.
      *Fix*: The CoreDNS upstream has a fix for this. Instead of throwing an overflow error and returning SERVFAIL upon overflow, it now truncates the response, indicating to the client to try again in TCP. The solution is to backport this fix into our current versions of CoreDNS.
      *Result*: Clusters with a non-compliant upstream will now retry with TCP upon overflow errors, preventing any disruption of functionality between OCP 4.12 and 4.13.
      Show
      * Previously, upgrading {product-title} can lead to DNS quieries failing due to upstream returning a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1. With this release, clusters with a non-compliant upstream will retry with TCP upon overflow errors which will prevent disruption of function when upgrading. (link: https://issues.redhat.com/browse/OCPBUGS-28200 [* OCPBUGS-28200 *]) //original bug text below. *Cause*: The Azure upstream DNS doesn't comply with non-EDNS DNS queries as it returns a payload larger than 512 bytes. Additionally, CoreDNS 1.10.1 no longer always uses EDNS for upstream queries; it only uses EDNS when the original client query used EDNS. This combination results in an overflow error and SERVFAIL if upstream returns a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1. *Consequence*: Upgrading from OCP 4.12 to 4.13 can lead to some DNS queries failing, which previously worked, as CoreDNS 1.10.1 can now query upstream without EDNS, exposing the non-compliant scenario in Azure upstream. *Fix*: The CoreDNS upstream has a fix for this. Instead of throwing an overflow error and returning SERVFAIL upon overflow, it now truncates the response, indicating to the client to try again in TCP. The solution is to backport this fix into our current versions of CoreDNS. *Result*: Clusters with a non-compliant upstream will now retry with TCP upon overflow errors, preventing any disruption of functionality between OCP 4.12 and 4.13.
    • Bug Fix
    • Proposed

    Description

       This is a clone of issue OCPBUGS-27904 for the 4.14.0 backport. The following is the description of the original issue:

      Description of problem:

      After the update to OpenShift Container Platform 4.13, it was reported that the SRV query for _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net is failing. The query is sent to CoreDNS is not matching any configured forwardPlugin and therefore the default is applied. When revering the dns-default pod Image back to OpenShift Container Platform 4.12 it works and this is also the workaround that has been put in place as production application were affected. Testing shows that the problem is available in OpenShift Container Platform 4.13, 4.14 and even 4.15. Forcing TCP on pod level does not change the behavior and the query will still fail. But when configuring a specific forwardPlugin for the Domain and enforcing DNS over TCP it also works again.
      
       - Adjusting bufsize did/does not help as the result was still the same (suspecting this because of https://issues.redhat.com/browse/OCPBUGS-21901 - but again, as no effect)
       - Only way to make it work, is to force_tcp either in default ". /etc/resolv.conf" section or by configure a forwardPlugin and forcing TCP
      
      Checking upstream, I found https://github.com/coredns/coredns/issues/5953 respectively https://github.com/coredns/coredns/pull/6277 which I suspect being related. When building from master CoreDNS branch it indeed starts to work again and resolving the SRV entry is possible again.
      
      ---
      
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.27   True        False         24h     Cluster version is 4.13.27
      
      $ oc get pod -o wide
      NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
      dns-default-626td     2/2     Running   0          3m15s   10.128.2.49    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      dns-default-74nnw     2/2     Running   0          87s     10.131.0.47    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      dns-default-8mggz     2/2     Running   0          2m31s   10.128.1.121   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      dns-default-clgkg     2/2     Running   0          109s    10.129.2.187   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      dns-default-htdw2     2/2     Running   0          2m10s   10.129.0.43    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      dns-default-wprln     2/2     Running   0          2m52s   10.130.1.70    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      node-resolver-4dmgj   1/1     Running   0          17h     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      node-resolver-5c6tj   1/1     Running   0          17h     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      node-resolver-chfr6   1/1     Running   0          17h     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      node-resolver-mnhsp   1/1     Running   0          17h     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-snxsb   1/1     Running   0          17h     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      node-resolver-sp7h8   1/1     Running   0          17h     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc get pod -o wide -n project-100
      NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
      tools-54f4d6844b-lr6z9   1/1     Running   0          17h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc get dns.operator default -o yaml
      apiVersion: operator.openshift.io/v1
      kind: DNS
      metadata:
        creationTimestamp: "2024-01-11T09:14:03Z"
        finalizers:
        - dns.operator.openshift.io/dns-controller
        generation: 4
        name: default
        resourceVersion: "4216641"
        uid: c8f5c627-2010-4c4a-a5fe-ed87f320e427
      spec:
        logLevel: Normal
        nodePlacement: {}
        operatorLogLevel: Normal
        servers:
        - forwardPlugin:
            policy: Random
            protocolStrategy: ""
            upstreams:
            - 10.0.0.9
          name: example
          zones:
          - example.xyz
        upstreamResolvers:
          policy: Sequential
          transportConfig: {}
          upstreams:
          - port: 53
            type: SystemResolvConf
      status:
        clusterDomain: cluster.local
        clusterIP: 172.30.0.10
        conditions:
        - lastTransitionTime: "2024-01-19T07:54:18Z"
          message: Enough DNS pods are available, and the DNS service has a cluster IP address.
          reason: AsExpected
          status: "False"
          type: Degraded
        - lastTransitionTime: "2024-01-19T07:55:02Z"
          message: All DNS and node-resolver pods are available, and the DNS service has
            a cluster IP address.
          reason: AsExpected
          status: "False"
          type: Progressing
        - lastTransitionTime: "2024-01-18T13:29:59Z"
          message: The DNS daemonset has available pods, and the DNS service has a cluster
            IP address.
          reason: AsExpected
          status: "True"
          type: Available
        - lastTransitionTime: "2024-01-11T09:14:04Z"
          message: DNS Operator can be upgraded
          reason: AsExpected
          status: "True"
          type: Upgradeable
      
      $ oc rsh -n project-100 tools-54f4d6844b-lr6z9
      sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
      Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)
      
      $ oc logs dns-default-74nnw
      Defaulted container "dns" out of: dns, kube-rbac-proxy
      .:5353
      hostname.bind.:5353
      example.xyz.:5353
      [INFO] plugin/reload: Running configuration SHA512 = 88c7c194d29d0a23b322aeee1eaa654ef385e6bd1affae3715028aba1d33cc8340e33184ba183f87e6c66a2014261c3e02edaea8e42ad01ec6a7c5edb34dfc6a
      CoreDNS-1.10.1
      linux/amd64, go1.19.13 X:strictfipsruntime, 
      [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.001868103s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003223099s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      
      ---
      
      https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.12.47/release.txt - using quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c0de49c0e76f2ee23a107fc9397f2fd32e7a6a8a458906afd6df04ff5bb0f7b
      
      $ oc get pod -o wide
      NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
      dns-default-8vrwd     2/2     Running   0          6m22s   10.129.0.45    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      dns-default-fm59d     2/2     Running   0          7m4s    10.129.2.190   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      dns-default-grtqs     2/2     Running   0          7m48s   10.130.1.73    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      dns-default-l8mp2     2/2     Running   0          6m43s   10.131.0.49    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      dns-default-slc4n     2/2     Running   0          8m11s   10.128.1.126   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      dns-default-xgr7c     2/2     Running   0          7m25s   10.128.2.51    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-2nmpx   1/1     Running   0          10m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      node-resolver-689j7   1/1     Running   0          10m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      node-resolver-8qhls   1/1     Running   0          10m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      node-resolver-nv8mq   1/1     Running   0          10m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-r52v7   1/1     Running   0          10m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      node-resolver-z8d4n   1/1     Running   0          10m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      
      $ oc get pod -n project-100 -o wide
      NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
      tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc rsh -n project-100 tools-54f4d6844b-lr6z9
      sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net.
      
      ---
      
      https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.15.0-rc.2/release.txt - using quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e8ffba7854f3f02e8940ddcb2636ceb4773db77872ff639a447c4bab3a69ecc
      
      $ oc get pod -o wide
      NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
      dns-default-gcs7s     2/2     Running   0          5m      10.128.2.52    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      dns-default-mnbh4     2/2     Running   0          4m37s   10.129.0.46    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      dns-default-p2s6v     2/2     Running   0          3m55s   10.130.1.77    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      dns-default-svccn     2/2     Running   0          3m13s   10.128.1.128   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      dns-default-tgktg     2/2     Running   0          3m34s   10.131.0.50    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      dns-default-xd5vq     2/2     Running   0          4m16s   10.129.2.191   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      node-resolver-2nmpx   1/1     Running   0          18m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      node-resolver-689j7   1/1     Running   0          18m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      node-resolver-8qhls   1/1     Running   0          18m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      node-resolver-nv8mq   1/1     Running   0          18m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-r52v7   1/1     Running   0          18m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      node-resolver-z8d4n   1/1     Running   0          18m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      
      $ oc get pod -n project-100 -o wide
      NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
      tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc rsh -n project-100 tools-54f4d6844b-lr6z9
      sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
      Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)
      
      $ oc logs dns-default-tgktg
      Defaulted container "dns" out of: dns, kube-rbac-proxy
      .:5353
      hostname.bind.:5353
      example.net.:5353
      [INFO] plugin/reload: Running configuration SHA512 = 8efa6675505d17551d17ca1e2ca45506a731dbab1f53dd687d37cb98dbaf4987a90622b6b030fe1643ba2cd17198a813ba9302b84ad729de4848f8998e768605
      CoreDNS-1.11.1
      linux/amd64, go1.20.10 X:strictfipsruntime, 
      [INFO] 10.131.0.40:35246 - 61734 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003577431s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      [INFO] 10.131.0.40:35246 - 61734 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000969251s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      
      ---
      
      quay.io/rhn_support_sreber/coredns:latest - based on https://github.com/coredns/coredns master branch build on January 19th 2024 (suspecting https://github.com/coredns/coredns/pull/6277 to be the fix)
      
      $ oc get pod -o wide
      NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
      dns-default-bpjpn     2/2     Running   0          2m22s   10.130.1.78    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      dns-default-c7wcz     2/2     Running   0          99s     10.131.0.51    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      dns-default-d7qjz     2/2     Running   0          3m6s    10.129.2.193   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      dns-default-dkvtp     2/2     Running   0          78s     10.128.1.131   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      dns-default-t6sv7     2/2     Running   0          2m44s   10.129.0.47    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      dns-default-vf9f6     2/2     Running   0          2m      10.128.2.53    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-2nmpx   1/1     Running   0          24m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      node-resolver-689j7   1/1     Running   0          24m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      node-resolver-8qhls   1/1     Running   0          24m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      node-resolver-nv8mq   1/1     Running   0          24m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-r52v7   1/1     Running   0          24m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      node-resolver-z8d4n   1/1     Running   0          24m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      
      $ oc get pod -n project-100 -o wide
      NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
      tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc rsh -n project-100 tools-54f4d6844b-lr6z9
      sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net.
      
      ---
      
      Back wth OpenShift Container Platform 4.13.27 but adjusting `CoreDNS` configuration. Defining specific forwardPlugin and enforcing TCP
      
      $ oc get dns.operator default -o yaml
      apiVersion: operator.openshift.io/v1
      kind: DNS
      metadata:
        creationTimestamp: "2024-01-11T09:14:03Z"
        finalizers:
        - dns.operator.openshift.io/dns-controller
        generation: 7
        name: default
        resourceVersion: "4230436"
        uid: c8f5c627-2010-4c4a-a5fe-ed87f320e427
      spec:
        logLevel: Normal
        nodePlacement: {}
        operatorLogLevel: Normal
        servers:
        - forwardPlugin:
            policy: Random
            protocolStrategy: TCP
            upstreams:
            - 10.0.0.9
          name: example
          zones:
          - example.net
        upstreamResolvers:
          policy: Sequential
          transportConfig: {}
          upstreams:
          - port: 53
            type: SystemResolvConf
      status:
        clusterDomain: cluster.local
        clusterIP: 172.30.0.10
        conditions:
        - lastTransitionTime: "2024-01-19T08:27:21Z"
          message: Enough DNS pods are available, and the DNS service has a cluster IP address.
          reason: AsExpected
          status: "False"
          type: Degraded
        - lastTransitionTime: "2024-01-19T08:28:03Z"
          message: All DNS and node-resolver pods are available, and the DNS service has
            a cluster IP address.
          reason: AsExpected
          status: "False"
          type: Progressing
        - lastTransitionTime: "2024-01-19T08:00:02Z"
          message: The DNS daemonset has available pods, and the DNS service has a cluster
            IP address.
          reason: AsExpected
          status: "True"
          type: Available
        - lastTransitionTime: "2024-01-11T09:14:04Z"
          message: DNS Operator can be upgraded
          reason: AsExpected
          status: "True"
          type: Upgradeable
      
      $ oc get pod -o wide
      NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
      dns-default-frdkm     2/2     Running   0          3m5s    10.131.0.52    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      dns-default-jsfkb     2/2     Running   0          99s     10.129.0.49    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      dns-default-jzzqc     2/2     Running   0          2m21s   10.128.2.54    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      dns-default-sgf4h     2/2     Running   0          2m      10.130.1.79    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      dns-default-t8nn7     2/2     Running   0          2m44s   10.129.2.194   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      dns-default-xmvqg     2/2     Running   0          3m27s   10.128.1.133   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      node-resolver-2nmpx   1/1     Running   0          29m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      node-resolver-689j7   1/1     Running   0          29m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      node-resolver-8qhls   1/1     Running   0          29m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      node-resolver-nv8mq   1/1     Running   0          29m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-r52v7   1/1     Running   0          29m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      node-resolver-z8d4n   1/1     Running   0          29m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      
      $ oc get pod -n project-100 -o wide
      NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
      tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc rsh -n project-100 tools-54f4d6844b-lr6z9
      sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net.
      _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net.
      
      ---
      
      Back wth OpenShift Container Platform 4.13.27 but now, forcing TCP on pod level
      
      $ oc get deployment tools -n project-100 -o yaml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        annotations:
          alpha.image.policy.openshift.io/resolve-names: '*'
          app.openshift.io/route-disabled: "false"
          deployment.kubernetes.io/revision: "5"
          image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"tools:latest","namespace":"project-100"},"fieldPath":"spec.template.spec.containers[?(@.name==\"tools\")].image","pause":"false"}]'
          openshift.io/generated-by: OpenShiftWebConsole
        creationTimestamp: "2024-01-17T11:22:05Z"
        generation: 5
        labels:
          app: tools
          app.kubernetes.io/component: tools
          app.kubernetes.io/instance: tools
          app.kubernetes.io/name: tools
          app.kubernetes.io/part-of: tools
          app.openshift.io/runtime: other-linux
          app.openshift.io/runtime-namespace: project-100
        name: tools
        namespace: project-100
        resourceVersion: "4232839"
        uid: a8157243-71e1-4597-9aa5-497afed5f722
      spec:
        progressDeadlineSeconds: 600
        replicas: 1
        revisionHistoryLimit: 10
        selector:
          matchLabels:
            app: tools
        strategy:
          rollingUpdate:
            maxSurge: 25%
            maxUnavailable: 25%
          type: RollingUpdate
        template:
          metadata:
            annotations:
              openshift.io/generated-by: OpenShiftWebConsole
            creationTimestamp: null
            labels:
              app: tools
              deployment: tools
          spec:
            containers:
            - command:
              - /bin/bash
              - -c
              - while true; do sleep 1;done
              image: image-registry.openshift-image-registry.svc:5000/project-100/tools@sha256:fba289d2ff20df2bfe38aa58fa3e491bbecf09e90e96b3c9b8c38f786dc2efb8
              imagePullPolicy: Always
              name: tools
              ports:
              - containerPort: 8080
                protocol: TCP
              resources: {}
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
            dnsConfig:
              options:
              - name: use-vc
            dnsPolicy: ClusterFirst
            restartPolicy: Always
            schedulerName: default-scheduler
            securityContext: {}
            terminationGracePeriodSeconds: 30
      status:
        availableReplicas: 1
        conditions:
        - lastTransitionTime: "2024-01-17T11:23:56Z"
          lastUpdateTime: "2024-01-17T11:23:56Z"
          message: Deployment has minimum availability.
          reason: MinimumReplicasAvailable
          status: "True"
          type: Available
        - lastTransitionTime: "2024-01-17T11:22:05Z"
          lastUpdateTime: "2024-01-19T08:33:28Z"
          message: ReplicaSet "tools-6749b4cf47" has successfully progressed.
          reason: NewReplicaSetAvailable
          status: "True"
          type: Progressing
        observedGeneration: 5
        readyReplicas: 1
        replicas: 1
        updatedReplicas: 1
      
      $ oc get pod -o wide
      NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
      dns-default-7kfzh     2/2     Running   0          2m25s   10.129.2.196   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      dns-default-g4mtd     2/2     Running   0          2m25s   10.128.2.55    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      dns-default-l4xkg     2/2     Running   0          2m26s   10.129.0.50    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      dns-default-l7rq8     2/2     Running   0          2m25s   10.128.1.135   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      dns-default-lt6zx     2/2     Running   0          2m26s   10.131.0.53    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      dns-default-t6bzl     2/2     Running   0          2m25s   10.130.1.82    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      node-resolver-279mf   1/1     Running   0          2m24s   10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
      node-resolver-2bzfc   1/1     Running   0          2m24s   10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
      node-resolver-bdz4m   1/1     Running   0          2m24s   10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
      node-resolver-jrv2w   1/1     Running   0          2m24s   10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
      node-resolver-lbfg5   1/1     Running   0          2m23s   10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
      node-resolver-qnm92   1/1     Running   0          2m24s   10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc get pod -n project-100 -o wide
      NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
      tools-6749b4cf47-gmw9v   1/1     Running   0          50s   10.131.0.54   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
      
      $ oc rsh -n project-100 tools-6749b4cf47-gmw9v
      sh-4.4$ cat /etc/resolv.conf 
      search project-100.svc.cluster.local svc.cluster.local cluster.local khrmlwa2zp4e1oisi1qjtoxwrc.bx.internal.cloudapp.net
      nameserver 172.30.0.10
      options ndots:5 use-vc
      
      sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
      Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)
      
      $ oc logs dns-default-lt6zx
      Defaulted container "dns" out of: dns, kube-rbac-proxy
      .:5353
      hostname.bind.:5353
      example.xyz.:5353
      [INFO] plugin/reload: Running configuration SHA512 = 79d17b9fc0f61d2c6db13a0f7f3d0a873c4d86ab5cba90c3819a5b57a48fac2ef0fb644b55e959984cd51377bff0db04f399a341a584c466e540a0d7501340f7
      CoreDNS-1.10.1
      linux/amd64, go1.19.13 X:strictfipsruntime, 
      [INFO] 10.131.0.40:51367 - 22867 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.00024781s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      [INFO] 10.131.0.40:51367 - 22867 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.00096551s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      [INFO] 10.131.0.54:44935 - 3087 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000619524s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      [INFO] 10.131.0.54:44935 - 3087 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000369584s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      
      

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.13, 4.14, 4.15
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Run "host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net" inside a pod
      

      Actual results:

      dns-default pod is reporting below error when running the query.
      
      [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.001868103s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003223099s
      [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
      
      And the command "host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net" will fail.
      
      sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
      Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)
      
      

      Expected results:

      No error reported in dns-default pod and query to actually return expected result
      
      

      Additional info:

      I suspect https://github.com/coredns/coredns/issues/5953 respectively https://github.com/coredns/coredns/pull/6277 being related. Hence built CoreDNS from master branch and created quay.io/rhn_support_sreber/coredns:latest. When running that Image in dns-default pod resolving the host query works again.
      

      Attachments

        Issue Links

          Activity

            People

              gspence@redhat.com Grant Spence
              rhn-support-sreber Simon Reber
              Melvin Joseph Melvin Joseph
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: