Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2602

Intermittent DNS Resolution Errors After Upgrading to 4.10.25

XMLWordPrintable

    • Important
    • None
    • 2
    • Sprint 228
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      When using cluster scaling and querying a URL in a pod on the side. All the while running some custom watches on endpoints and nodes.

      When the nodes scale down, it seems a few seconds before an event marks the node as Not Ready and before the dns-default endpoint is removed from the endpoints list a DNS query can fail.

      We wrote some simply watcher (see below for details) to log this and got the following events:

      DNS lookup failure:

      Tue Oct 18 12:33:23 UTC 2022 - Lookup success                                                                                                                                                                                                                                                                                                      
      Tue Oct 18 12:33:28 UTC 2022 - DNS failure                                                                                                                                                                                                                                                                                                         
      Tue Oct 18 12:33:41 UTC 2022 - Lookup success 
      

      The node was not yet Not Ready and the endpoint was still in the list of endpoints at that time (ntrdy indicates a NotReadyEndpoint):

      2022-10-18 12:33:21.712180649 +0000 UTC m=+1047.610174444 - ip-10-0-137-100.ec2.internal -  MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True,
      2022-10-18 12:33:39.11806612 +0000 UTC m=+1065.016059955 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:39.525574893 +0000 UTC m=+1065.423568712 - dns-default rdy: 10.128.0.2 rdy: 10.128.10.4 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.131.8.4
      2022-10-18 12:33:39.526424974 +0000 UTC m=+1065.424418833 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4
      2022-10-18 12:33:39.528532869 +0000 UTC m=+1065.426526744 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:39.729859144 +0000 UTC m=+1065.627852917 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:39.936928994 +0000 UTC m=+1065.834922825 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:44.749587947 +0000 UTC m=+1070.647581767 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:44.952196646 +0000 UTC m=+1070.850190469 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.130.8.4 ntrdy: 10.131.8.4
      2022-10-18 12:33:44.954865089 +0000 UTC m=+1070.852858965 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:45.159460169 +0000 UTC m=+1071.057454007 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:48.641412229 +0000 UTC m=+1074.539406059 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:48.846438064 +0000 UTC m=+1074.744431900 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:33:54.068542745 +0000 UTC m=+1079.966536563 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
      2022-10-18 12:34:31.752294563 +0000 UTC m=+1117.650288381 - ip-10-0-253-198.ec2.internal -  MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True,
      2022-10-18 12:34:39.531848219 +0000 UTC m=+1125.429842032 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4
      2022-10-18 12:34:39.736866622 +0000 UTC m=+1125.634860439 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4
      2022-10-18 12:34:39.941934912 +0000 UTC m=+1125.839928742 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3
      

      So we can observe that the node goes into 'Unknown' at 12:33:39, and the endpoint goes into Not Ready soon after.
      Not sure if this is a logic problem of draining a node or an issue with the autoscaler at this point in time, but it fixes itself at the next lookup 5 seconds later.

      Detailed breakdown of how this was reproduced:
      1. A cluster with autoscaling enabled is required.
      2. Deploy a daemonset that attempt to use DNS / HTTP in a loop, e.g. the following Daemonset was used to test this:

      
      apiVersion: apps/v1
      kind: DaemonSet
      metadata: 
        name: dns-tester
        labels: 
          app: dns-tester
      spec: 
        selector: 
          matchLabels: 
            app: dns-tester
        template: 
          metadata: 
            labels: 
              app: dns-tester
          spec: 
            containers: 
            - name: dns-tester
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:72f2f7e906c321da6d6a00ce610780e8766e8432f7c553c5d03492f65fe5416c
              command: ["/bin/sh", "-c"]
              args: ['while true; do CURL=$(curl redhat.com 2>&1); if [[ "$CURL" == *"not resolve"* ]]; then echo `date` - "DNS failure"; else echo `date` - "Lookup success"; fi;  sleep 5; done']
              resources: 
                limits: 
                  cpu: 100m
                  memory: 200Mi 
      

      3. Run the following go program against the same cluster (this is what watches the node and endpoint events for the dns-default endpoints):

      package main
      
      import (
      	"context"
      	"fmt"
      	"path/filepath"
      	"sync"
      	"time"
      
      	corev1 "k8s.io/api/core/v1"
      	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
      	"k8s.io/client-go/kubernetes"
      	_ "k8s.io/client-go/plugin/pkg/client/auth"
      	"k8s.io/client-go/tools/clientcmd"
      	"k8s.io/client-go/util/homedir"
      )
      
      const dnsNamespace = "openshift-dns"
      const dnsEndpoint = "dns-default"
      
      func nodeWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) {
      	ctx := context.Background()
      	defer waitGroup.Done()
      
      	var nodes = clientset.CoreV1().Nodes()
      	watcher, err := nodes.Watch(ctx, metav1.ListOptions{})
      	if err != nil {
      		panic(err.Error())
      	}
      	ch := watcher.ResultChan()
      
      	for {
      		event := <-ch
      		node, ok := event.Object.(*corev1.Node)
      		if !ok {
      			fmt.Printf("%v", event)
      			panic("Could not cast to nodes")
      		}
      		fmt.Printf("%v - %s - ", time.Now(), node.Name)
      		for _, condition := range node.Status.Conditions {
      			fmt.Printf(" %v - %v,", condition.Type, condition.Status)
      		}
      		fmt.Println()
      	}
      }
      
      func dnsWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) {
      	ctx := context.Background()
      	defer waitGroup.Done()
      	var api = clientset.CoreV1().Endpoints(dnsNamespace)
      	watcher, err := api.Watch(ctx, metav1.ListOptions{})
      	if err != nil {
      		panic(err.Error())
      	}
      	ch := watcher.ResultChan()
      	for {
      		event := <-ch
      		endpoints, ok := event.Object.(*corev1.Endpoints)
      		if !ok {
      			fmt.Printf("%v", event)
      			panic("Could not cast to Endpoint")
      		}
      		fmt.Printf("%v - %v", time.Now(), endpoints.ObjectMeta.Name)
      		for _, endpoint := range endpoints.Subsets {
      			for _, address := range endpoint.Addresses {
      				fmt.Printf(" rdy: %v", address.IP)
      			}
      			for _, address := range endpoint.NotReadyAddresses {
      				fmt.Printf(" ntrdy: %v", address.IP)
      			}
      		}
      		fmt.Println()
      	}
      }
      
      func main() {
      	// AUTHENTICATE
      	var home = homedir.HomeDir()
      	var kubeconfig = filepath.Join(home, ".kube", "config")
      	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
      	if err != nil {
      		panic(err.Error())
      	}
      	clientset, err := kubernetes.NewForConfig(config)
      	if err != nil {
      		panic(err.Error())
      	}
      	wg := sync.WaitGroup{}
      	wg.Add(2)
      	go dnsWatch(clientset, wg)
      	go nodeWatch(clientset, wg)
      	wg.Wait()
      }
      

      4. Create simulated pressure on the nodes to force a scaleup - e.g. use the following deployment:

      apiVersion: apps/v1
      kind: Deployment
      metadata: 
        name: resource-eater
      spec: 
        replicas: 4
        selector: 
          matchLabels: 
            app: resource-eater
        template: 
          metadata: 
            labels: 
              app: resource-eater
          spec: 
            containers: 
            - name: resource-eater
              image: busybox:latest
              command: ["/bin/sh", "-c"]
              args: ["sleep 3600"]
              resources: 
                requests: 
                  memory: "8Gi"
                  cpu: "1000m"
            affinity: 
              podAntiAffinity: 
                requiredDuringSchedulingIgnoredDuringExecution: 
                - labelSelector: 
                    matchExpressions: 
                    - key: app
                      operator: In
                      values: 
                      - store
                  topologyKey: "kubernetes.io/hostname" 
      

      5. Wait for the scale up to happen.
      6. Delete the deployment that created the node pressure, so the scale down can happen (this can easily take 15 minutes).
      7. Observe the events in the watcher program and the logs for the daemonset - this should show the same behavior as detailed above.

      we found a few logged bugs that seemed related to this issue affecting clusters on 4.8 through 4.10. Those bugs are as follows:

      https://issues.redhat.com/browse/OCPBUGS-647
      https://issues.redhat.com/browse/OCPBUGS-488
      https://bugzilla.redhat.com/show_bug.cgi?id=2061244
      

      Using the boave mentioned steps, we have been able to reliably reproduce the issue of DNS failures during autoscale-down in 4.10 clusters.

            mmasters1@redhat.com Miciah Masters
            rhn-support-maupadhy Madhusudan Upadhyay
            Melvin Joseph Melvin Joseph
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: