[OCPBUGS-28200] SRV lookup is failing after OpenShift Container Platform 4.13 update because of CoreDNS version 1.10.1 - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.14.z
Affects Version/s: 4.13, 4.14, 4.15
Component/s: Networking / DNS
Labels:
- ne-triaged
- pre-merge-verify

Test Coverage:

+
Severity:
Critical
Regression:
No
Story Points:
2
Sprint:
Sprint 248
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, upgrading {product-title} can lead to DNS quieries failing due to upstream returning a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1. With this release, clusters with a non-compliant upstream will retry with TCP upon overflow errors which will prevent disruption of function when upgrading. (link:https://issues.redhat.com/browse/OCPBUGS-28200[*~~OCPBUGS-28200~~*])

//original bug text below.
*Cause*: The Azure upstream DNS doesn't comply with non-EDNS DNS queries as it returns a payload larger than 512 bytes. Additionally, CoreDNS 1.10.1 no longer always uses EDNS for upstream queries; it only uses EDNS when the original client query used EDNS. This combination results in an overflow error and SERVFAIL if upstream returns a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1.
*Consequence*: Upgrading from OCP 4.12 to 4.13 can lead to some DNS queries failing, which previously worked, as CoreDNS 1.10.1 can now query upstream without EDNS, exposing the non-compliant scenario in Azure upstream.
*Fix*: The CoreDNS upstream has a fix for this. Instead of throwing an overflow error and returning SERVFAIL upon overflow, it now truncates the response, indicating to the client to try again in TCP. The solution is to backport this fix into our current versions of CoreDNS.
*Result*: Clusters with a non-compliant upstream will now retry with TCP upon overflow errors, preventing any disruption of functionality between OCP 4.12 and 4.13.

Show
* Previously, upgrading {product-title} can lead to DNS quieries failing due to upstream returning a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1. With this release, clusters with a non-compliant upstream will retry with TCP upon overflow errors which will prevent disruption of function when upgrading. (link: https://issues.redhat.com/browse/OCPBUGS-28200 [* OCPBUGS-28200 *]) //original bug text below. *Cause*: The Azure upstream DNS doesn't comply with non-EDNS DNS queries as it returns a payload larger than 512 bytes. Additionally, CoreDNS 1.10.1 no longer always uses EDNS for upstream queries; it only uses EDNS when the original client query used EDNS. This combination results in an overflow error and SERVFAIL if upstream returns a payload larger than 512 bytes for non-EDNS queries using CoreDNS 1.10.1. *Consequence*: Upgrading from OCP 4.12 to 4.13 can lead to some DNS queries failing, which previously worked, as CoreDNS 1.10.1 can now query upstream without EDNS, exposing the non-compliant scenario in Azure upstream. *Fix*: The CoreDNS upstream has a fix for this. Instead of throwing an overflow error and returning SERVFAIL upon overflow, it now truncates the response, indicating to the client to try again in TCP. The solution is to backport this fix into our current versions of CoreDNS. *Result*: Clusters with a non-compliant upstream will now retry with TCP upon overflow errors, preventing any disruption of functionality between OCP 4.12 and 4.13.
Release Note Type:
Bug Fix
Release Note Status:
Proposed
Target Version:

4.14.z
Target Backport Versions:

4.13.z, 4.14.z, 4.15.z
Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

This is a clone of issue ~~OCPBUGS-27904~~ for the 4.14.0 backport. The following is the description of the original issue:

Description of problem:

After the update to OpenShift Container Platform 4.13, it was reported that the SRV query for _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net is failing. The query is sent to CoreDNS is not matching any configured forwardPlugin and therefore the default is applied. When revering the dns-default pod Image back to OpenShift Container Platform 4.12 it works and this is also the workaround that has been put in place as production application were affected. Testing shows that the problem is available in OpenShift Container Platform 4.13, 4.14 and even 4.15. Forcing TCP on pod level does not change the behavior and the query will still fail. But when configuring a specific forwardPlugin for the Domain and enforcing DNS over TCP it also works again.

 - Adjusting bufsize did/does not help as the result was still the same (suspecting this because of https://issues.redhat.com/browse/OCPBUGS-21901 - but again, as no effect)
 - Only way to make it work, is to force_tcp either in default ". /etc/resolv.conf" section or by configure a forwardPlugin and forcing TCP

Checking upstream, I found https://github.com/coredns/coredns/issues/5953 respectively https://github.com/coredns/coredns/pull/6277 which I suspect being related. When building from master CoreDNS branch it indeed starts to work again and resolving the SRV entry is possible again.

---

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.27   True        False         24h     Cluster version is 4.13.27

$ oc get pod -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
dns-default-626td     2/2     Running   0          3m15s   10.128.2.49    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
dns-default-74nnw     2/2     Running   0          87s     10.131.0.47    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
dns-default-8mggz     2/2     Running   0          2m31s   10.128.1.121   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
dns-default-clgkg     2/2     Running   0          109s    10.129.2.187   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
dns-default-htdw2     2/2     Running   0          2m10s   10.129.0.43    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
dns-default-wprln     2/2     Running   0          2m52s   10.130.1.70    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
node-resolver-4dmgj   1/1     Running   0          17h     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
node-resolver-5c6tj   1/1     Running   0          17h     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
node-resolver-chfr6   1/1     Running   0          17h     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
node-resolver-mnhsp   1/1     Running   0          17h     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-snxsb   1/1     Running   0          17h     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
node-resolver-sp7h8   1/1     Running   0          17h     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc get pod -o wide -n project-100
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
tools-54f4d6844b-lr6z9   1/1     Running   0          17h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc get dns.operator default -o yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
  creationTimestamp: "2024-01-11T09:14:03Z"
  finalizers:
  - dns.operator.openshift.io/dns-controller
  generation: 4
  name: default
  resourceVersion: "4216641"
  uid: c8f5c627-2010-4c4a-a5fe-ed87f320e427
spec:
  logLevel: Normal
  nodePlacement: {}
  operatorLogLevel: Normal
  servers:
  - forwardPlugin:
      policy: Random
      protocolStrategy: ""
      upstreams:
      - 10.0.0.9
    name: example
    zones:
    - example.xyz
  upstreamResolvers:
    policy: Sequential
    transportConfig: {}
    upstreams:
    - port: 53
      type: SystemResolvConf
status:
  clusterDomain: cluster.local
  clusterIP: 172.30.0.10
  conditions:
  - lastTransitionTime: "2024-01-19T07:54:18Z"
    message: Enough DNS pods are available, and the DNS service has a cluster IP address.
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-01-19T07:55:02Z"
    message: All DNS and node-resolver pods are available, and the DNS service has
      a cluster IP address.
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2024-01-18T13:29:59Z"
    message: The DNS daemonset has available pods, and the DNS service has a cluster
      IP address.
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2024-01-11T09:14:04Z"
    message: DNS Operator can be upgraded
    reason: AsExpected
    status: "True"
    type: Upgradeable

$ oc rsh -n project-100 tools-54f4d6844b-lr6z9
sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)

$ oc logs dns-default-74nnw
Defaulted container "dns" out of: dns, kube-rbac-proxy
.:5353
hostname.bind.:5353
example.xyz.:5353
[INFO] plugin/reload: Running configuration SHA512 = 88c7c194d29d0a23b322aeee1eaa654ef385e6bd1affae3715028aba1d33cc8340e33184ba183f87e6c66a2014261c3e02edaea8e42ad01ec6a7c5edb34dfc6a
CoreDNS-1.10.1
linux/amd64, go1.19.13 X:strictfipsruntime, 
[INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.001868103s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
[INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003223099s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size

---

https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.12.47/release.txt - using quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c0de49c0e76f2ee23a107fc9397f2fd32e7a6a8a458906afd6df04ff5bb0f7b

$ oc get pod -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
dns-default-8vrwd     2/2     Running   0          6m22s   10.129.0.45    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
dns-default-fm59d     2/2     Running   0          7m4s    10.129.2.190   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
dns-default-grtqs     2/2     Running   0          7m48s   10.130.1.73    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
dns-default-l8mp2     2/2     Running   0          6m43s   10.131.0.49    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
dns-default-slc4n     2/2     Running   0          8m11s   10.128.1.126   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
dns-default-xgr7c     2/2     Running   0          7m25s   10.128.2.51    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-2nmpx   1/1     Running   0          10m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
node-resolver-689j7   1/1     Running   0          10m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
node-resolver-8qhls   1/1     Running   0          10m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
node-resolver-nv8mq   1/1     Running   0          10m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-r52v7   1/1     Running   0          10m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
node-resolver-z8d4n   1/1     Running   0          10m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>

$ oc get pod -n project-100 -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc rsh -n project-100 tools-54f4d6844b-lr6z9
sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net.

---

https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.15.0-rc.2/release.txt - using quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e8ffba7854f3f02e8940ddcb2636ceb4773db77872ff639a447c4bab3a69ecc

$ oc get pod -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
dns-default-gcs7s     2/2     Running   0          5m      10.128.2.52    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
dns-default-mnbh4     2/2     Running   0          4m37s   10.129.0.46    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
dns-default-p2s6v     2/2     Running   0          3m55s   10.130.1.77    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
dns-default-svccn     2/2     Running   0          3m13s   10.128.1.128   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
dns-default-tgktg     2/2     Running   0          3m34s   10.131.0.50    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
dns-default-xd5vq     2/2     Running   0          4m16s   10.129.2.191   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
node-resolver-2nmpx   1/1     Running   0          18m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
node-resolver-689j7   1/1     Running   0          18m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
node-resolver-8qhls   1/1     Running   0          18m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
node-resolver-nv8mq   1/1     Running   0          18m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-r52v7   1/1     Running   0          18m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
node-resolver-z8d4n   1/1     Running   0          18m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>

$ oc get pod -n project-100 -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc rsh -n project-100 tools-54f4d6844b-lr6z9
sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)

$ oc logs dns-default-tgktg
Defaulted container "dns" out of: dns, kube-rbac-proxy
.:5353
hostname.bind.:5353
example.net.:5353
[INFO] plugin/reload: Running configuration SHA512 = 8efa6675505d17551d17ca1e2ca45506a731dbab1f53dd687d37cb98dbaf4987a90622b6b030fe1643ba2cd17198a813ba9302b84ad729de4848f8998e768605
CoreDNS-1.11.1
linux/amd64, go1.20.10 X:strictfipsruntime, 
[INFO] 10.131.0.40:35246 - 61734 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003577431s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
[INFO] 10.131.0.40:35246 - 61734 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000969251s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size

---

quay.io/rhn_support_sreber/coredns:latest - based on https://github.com/coredns/coredns master branch build on January 19th 2024 (suspecting https://github.com/coredns/coredns/pull/6277 to be the fix)

$ oc get pod -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
dns-default-bpjpn     2/2     Running   0          2m22s   10.130.1.78    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
dns-default-c7wcz     2/2     Running   0          99s     10.131.0.51    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
dns-default-d7qjz     2/2     Running   0          3m6s    10.129.2.193   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
dns-default-dkvtp     2/2     Running   0          78s     10.128.1.131   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
dns-default-t6sv7     2/2     Running   0          2m44s   10.129.0.47    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
dns-default-vf9f6     2/2     Running   0          2m      10.128.2.53    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-2nmpx   1/1     Running   0          24m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
node-resolver-689j7   1/1     Running   0          24m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
node-resolver-8qhls   1/1     Running   0          24m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
node-resolver-nv8mq   1/1     Running   0          24m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-r52v7   1/1     Running   0          24m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
node-resolver-z8d4n   1/1     Running   0          24m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>

$ oc get pod -n project-100 -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc rsh -n project-100 tools-54f4d6844b-lr6z9
sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net.

---

Back wth OpenShift Container Platform 4.13.27 but adjusting `CoreDNS` configuration. Defining specific forwardPlugin and enforcing TCP

$ oc get dns.operator default -o yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
  creationTimestamp: "2024-01-11T09:14:03Z"
  finalizers:
  - dns.operator.openshift.io/dns-controller
  generation: 7
  name: default
  resourceVersion: "4230436"
  uid: c8f5c627-2010-4c4a-a5fe-ed87f320e427
spec:
  logLevel: Normal
  nodePlacement: {}
  operatorLogLevel: Normal
  servers:
  - forwardPlugin:
      policy: Random
      protocolStrategy: TCP
      upstreams:
      - 10.0.0.9
    name: example
    zones:
    - example.net
  upstreamResolvers:
    policy: Sequential
    transportConfig: {}
    upstreams:
    - port: 53
      type: SystemResolvConf
status:
  clusterDomain: cluster.local
  clusterIP: 172.30.0.10
  conditions:
  - lastTransitionTime: "2024-01-19T08:27:21Z"
    message: Enough DNS pods are available, and the DNS service has a cluster IP address.
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-01-19T08:28:03Z"
    message: All DNS and node-resolver pods are available, and the DNS service has
      a cluster IP address.
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2024-01-19T08:00:02Z"
    message: The DNS daemonset has available pods, and the DNS service has a cluster
      IP address.
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2024-01-11T09:14:04Z"
    message: DNS Operator can be upgraded
    reason: AsExpected
    status: "True"
    type: Upgradeable

$ oc get pod -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
dns-default-frdkm     2/2     Running   0          3m5s    10.131.0.52    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
dns-default-jsfkb     2/2     Running   0          99s     10.129.0.49    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
dns-default-jzzqc     2/2     Running   0          2m21s   10.128.2.54    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
dns-default-sgf4h     2/2     Running   0          2m      10.130.1.79    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
dns-default-t8nn7     2/2     Running   0          2m44s   10.129.2.194   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
dns-default-xmvqg     2/2     Running   0          3m27s   10.128.1.133   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
node-resolver-2nmpx   1/1     Running   0          29m     10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
node-resolver-689j7   1/1     Running   0          29m     10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
node-resolver-8qhls   1/1     Running   0          29m     10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
node-resolver-nv8mq   1/1     Running   0          29m     10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-r52v7   1/1     Running   0          29m     10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
node-resolver-z8d4n   1/1     Running   0          29m     10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>

$ oc get pod -n project-100 -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
tools-54f4d6844b-lr6z9   1/1     Running   0          18h   10.131.0.40   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc rsh -n project-100 tools-54f4d6844b-lr6z9
sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net.
_example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net.

---

Back wth OpenShift Container Platform 4.13.27 but now, forcing TCP on pod level

$ oc get deployment tools -n project-100 -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    alpha.image.policy.openshift.io/resolve-names: '*'
    app.openshift.io/route-disabled: "false"
    deployment.kubernetes.io/revision: "5"
    image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"tools:latest","namespace":"project-100"},"fieldPath":"spec.template.spec.containers[?(@.name==\"tools\")].image","pause":"false"}]'
    openshift.io/generated-by: OpenShiftWebConsole
  creationTimestamp: "2024-01-17T11:22:05Z"
  generation: 5
  labels:
    app: tools
    app.kubernetes.io/component: tools
    app.kubernetes.io/instance: tools
    app.kubernetes.io/name: tools
    app.kubernetes.io/part-of: tools
    app.openshift.io/runtime: other-linux
    app.openshift.io/runtime-namespace: project-100
  name: tools
  namespace: project-100
  resourceVersion: "4232839"
  uid: a8157243-71e1-4597-9aa5-497afed5f722
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: tools
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        openshift.io/generated-by: OpenShiftWebConsole
      creationTimestamp: null
      labels:
        app: tools
        deployment: tools
    spec:
      containers:
      - command:
        - /bin/bash
        - -c
        - while true; do sleep 1;done
        image: image-registry.openshift-image-registry.svc:5000/project-100/tools@sha256:fba289d2ff20df2bfe38aa58fa3e491bbecf09e90e96b3c9b8c38f786dc2efb8
        imagePullPolicy: Always
        name: tools
        ports:
        - containerPort: 8080
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsConfig:
        options:
        - name: use-vc
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-01-17T11:23:56Z"
    lastUpdateTime: "2024-01-17T11:23:56Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-01-17T11:22:05Z"
    lastUpdateTime: "2024-01-19T08:33:28Z"
    message: ReplicaSet "tools-6749b4cf47" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 5
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

$ oc get pod -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
dns-default-7kfzh     2/2     Running   0          2m25s   10.129.2.196   aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
dns-default-g4mtd     2/2     Running   0          2m25s   10.128.2.55    aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
dns-default-l4xkg     2/2     Running   0          2m26s   10.129.0.50    aro-cluster-h78zv-h94mh-master-2               <none>           <none>
dns-default-l7rq8     2/2     Running   0          2m25s   10.128.1.135   aro-cluster-h78zv-h94mh-master-0               <none>           <none>
dns-default-lt6zx     2/2     Running   0          2m26s   10.131.0.53    aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>
dns-default-t6bzl     2/2     Running   0          2m25s   10.130.1.82    aro-cluster-h78zv-h94mh-master-1               <none>           <none>
node-resolver-279mf   1/1     Running   0          2m24s   10.0.2.6       aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh   <none>           <none>
node-resolver-2bzfc   1/1     Running   0          2m24s   10.0.2.4       aro-cluster-h78zv-h94mh-worker-eastus3-jhvff   <none>           <none>
node-resolver-bdz4m   1/1     Running   0          2m24s   10.0.0.7       aro-cluster-h78zv-h94mh-master-2               <none>           <none>
node-resolver-jrv2w   1/1     Running   0          2m24s   10.0.0.9       aro-cluster-h78zv-h94mh-master-1               <none>           <none>
node-resolver-lbfg5   1/1     Running   0          2m23s   10.0.0.10      aro-cluster-h78zv-h94mh-master-0               <none>           <none>
node-resolver-qnm92   1/1     Running   0          2m24s   10.0.2.5       aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc get pod -n project-100 -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
tools-6749b4cf47-gmw9v   1/1     Running   0          50s   10.131.0.54   aro-cluster-h78zv-h94mh-worker-eastus1-99l7n   <none>           <none>

$ oc rsh -n project-100 tools-6749b4cf47-gmw9v
sh-4.4$ cat /etc/resolv.conf 
search project-100.svc.cluster.local svc.cluster.local cluster.local khrmlwa2zp4e1oisi1qjtoxwrc.bx.internal.cloudapp.net
nameserver 172.30.0.10
options ndots:5 use-vc

sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)

$ oc logs dns-default-lt6zx
Defaulted container "dns" out of: dns, kube-rbac-proxy
.:5353
hostname.bind.:5353
example.xyz.:5353
[INFO] plugin/reload: Running configuration SHA512 = 79d17b9fc0f61d2c6db13a0f7f3d0a873c4d86ab5cba90c3819a5b57a48fac2ef0fb644b55e959984cd51377bff0db04f399a341a584c466e540a0d7501340f7
CoreDNS-1.10.1
linux/amd64, go1.19.13 X:strictfipsruntime, 
[INFO] 10.131.0.40:51367 - 22867 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.00024781s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
[INFO] 10.131.0.40:51367 - 22867 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.00096551s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
[INFO] 10.131.0.54:44935 - 3087 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000619524s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
[INFO] 10.131.0.54:44935 - 3087 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000369584s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.13, 4.14, 4.15

How reproducible:

Always

Steps to Reproduce:

1. Run "host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net" inside a pod

Actual results:

dns-default pod is reporting below error when running the query.

[INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.001868103s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
[INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003223099s
[ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size

And the command "host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net" will fail.

sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net
Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)

Expected results:

No error reported in dns-default pod and query to actually return expected result

Additional info:

I suspect https://github.com/coredns/coredns/issues/5953 respectively https://github.com/coredns/coredns/pull/6277 being related. Hence built CoreDNS from master branch and created quay.io/rhn_support_sreber/coredns:latest. When running that Image in dns-default pod resolving the host query works again.

blocks

OCPBUGS-28205 SRV lookup is failing after OpenShift Container Platform 4.13 update because of CoreDNS version 1.10.1

Closed

clones

OCPBUGS-27904 SRV lookup is failing after OpenShift Container Platform 4.13 update because of CoreDNS version 1.10.1

Closed

is blocked by

OCPBUGS-27904 SRV lookup is failing after OpenShift Container Platform 4.13 update because of CoreDNS version 1.10.1

Closed

is cloned by

OCPBUGS-28205 SRV lookup is failing after OpenShift Container Platform 4.13 update because of CoreDNS version 1.10.1

Closed

links to

openshift/coredns#114: [release-4.14] OCPBUGS-28200: UPSTREAM: 6277: openshift: Fix OCPBUGS-28200

RHSA-2024:0642 OpenShift Container Platform 4.14.11 bug fix and security update

(1 links to)

Assignee:: Grant Spence (Inactive)

Reporter:: Simon Reber

QA Contact:: Melvin Joseph

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/01/25 2:57 PM

Updated:: 2024/08/27 1:49 PM

Resolved:: 2024/02/07 5:37 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide