Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.11.0
Affects Version/s: 4.11, 4.10, 4.9, 4.8
Component/s: Networking / DNS
Labels:
- cluster-dns-operator

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.11.0
Release Blocker:
Rejected
Sprint:
Sprint 228
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

The DNS operator does not set the cluster-autoscaler.kubernetes.io/enable-ds-eviction annotation on DNS pods, and so the cluster autoscaler does not evict the DNS pod from a node before removing the node.

Generally, every node in a cluster has a local DNS pod, and pods on that node use that local DNS pod for DNS lookups. However, if a node is tainted, it may not have a local DNS pod, and then lookups on that node may go to the DNS pod on a node that the cluster autoscaler is in the process of shutting down.

If DNS queries go to a node that the cluster autoscaler is in the process of shutting down, then those DNS queries may be dropped, causing intermittent DNS lookup failures.

Version-Release number of selected component (if applicable):

4.8, 4.9, 4.10, and 4.11. OpenShift 4.7 configures DNS pods with a toleration for all taints, meaning that every node has a local DNS pod, which generally prevents unavailability of a DNS pod on one node from affecting DNS lookups on another node.

How reproducible:

Always.

Steps to Reproduce:

The simplest way to reproduce the bug is simply to check whether the "dns-default" pods have the cluster-autoscaler.kubernetes.io/enable-ds-eviction annotation:
1. Check the annotations on the "dns-default" pods: {{oc -n openshift-dns get pods -l dns.operator.openshift.io/daemonset-dns=default -o 'jsonpath=

{.items..metadata.annotations.cluster-autoscaler\.kubernetes\.io/enable-ds-eviction}}}

Reproducing the DNS lookup failures is more involved; I believe the following steps should suffice:
1. Taint some node: oc adm taint nodes example-node foo:NoExecute
2. Verify that there is no "dns-default" pod on the tainted node (if the pod is still there, delete it and verify that it isn't recreated): oc -n openshift-dns get pods -l dns.operator.openshift.io/daemonset-dns=default -o wide | grep -e example-node
3. In an arbitrary pod with the default DNS configuration and on the tainted node, run DNS queries in a loop: while :; do dig kubernetes.default.svc.cluster.local. A; sleep 1; done
4. Watch the "dns-default" endpoints object: oc -n openshift-dns get endpoints/dns-default -o yaml -w
5. Trigger scale-down of an untainted node.

h2. Actual results
The "dns-default" pods don't have the cluster-autoscaler.kubernetes.io/enable-ds-eviction annotation, and the oc command prints no output.

When the cluster autoscaler scales down a node, the "dns-default" endpoints isn't immediately updated to reflect that the DNS pod on that node is unavailable, and DNS lookups in the pod on the tainted node may fail intermittently as the lookups round-robin through the endpoints for DNS pods on other nodes, including the node that is shutting down.
h2. Expected results
The DNS operator should set the cluster-autoscaler.kubernetes.io/enable-ds-eviction annotation on "dns-default" pods, and the oc command should print "true" for each "dns-default" pod:

% oc -n openshift-dns get pods -l dns.operator.openshift.io/daemonset-dns=default -o 'jsonpath={.items..metadata.annotations.cluster-autoscaler.kubernetes.io/enable-ds-eviction}

'
true true true true true true
%

The "dns-default" endpoints should be updated to reflect that the DNS pod on the node being shut down is unavailable before the pod becomes unresponsive. DNS lookups on the tainted node should succeed consistently.

Additional info:

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2061244 to facilitate backports.

Note that the original bug fix caused ~~OCPBUGS-753~~, so backports of the fix for this bug (openshift/cluster-dns-operator#320) should incorporate the fix for that bug openshift/cluster-dns-operator#340).

is depended on by

OCPBUGS-2602 Intermittent DNS Resolution Errors After Upgrading to 4.10.25

Closed

links to

BZ 2061244: [RHOCP4.8] OpenShift DNS doesn't work on node using taints feature

Assignee:: Miciah Masters

Reporter:: Miciah Masters

Need Info From:: None

Contributors:: None

QA Contact:: Hongan Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2022/11/29 9:43 PM

Updated:: 2025/07/28 5:39 PM

Resolved:: 2022/11/29 9:44 PM

Details

Description

Description of problem

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates