Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18654

kube-apiserver was flooded by localhost DNS lookup connection error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.12.z
    • Networking / DNS
    • None
    • Important
    • No
    • 1
    • Sprint 244, Sprint 245, Sprint 246
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      All the static pods (kube-apiserver/KCM/ETCD) from the cluster is getting restarted frequently.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Not reproduced yet

      Steps to Reproduce:

      Below are the observations.
      
      1. The ETCD pods are getting frequently restarted due to probe failures
      
      ~~~
      Sep 04 23:12:00 jprocpuatmst03.ocpcorpuat.icicibankltd.com kubenswrapper[2140]: I0904 23:12:00.536323    2140 patch_prober.go:29] interesting pod/xxxx container/etcd namespace/openshift-etcd: Liveness probe status=failure output="HTTP probe failed with statuscode: 503" start-of-body=failed to establish etcd client: giving up getting a cached client after 3 tries
      Sep 04 23:12:00 xxxx kubenswrapper[2140]: I0904 23:12:00.536391    2140 prober.go:114] "Probe failed" probeType="Liveness" pod="openshift-etcd/xxxx" podUID=8e435d267e850090d4a8e69c9f51b48e containerName="etcd" probeResult=failure output="HTTP probe failed with statuscode: 503"
      Sep 04 23:12:32 xxxx kubenswrapper[2140]: I0904 23:12:32.734253    2140 patch_prober.go:29] interesting pod/etcd-xxxx container/etcd namespace/openshift-etcd: Liveness probe status=failure output="HTTP probe failed with statuscode: 503" start-of-body=failed to establish etcd client: giving up getting a cached client after 3 tries
      Sep 04 23:12:32 xxxx kubenswrapper[2140]: I0904 23:12:32.734339    2140 prober.go:114] "Probe failed" probeType="Liveness" pod="openshift-etcd/xxxx" podUID=8e435d267e850090d4a8e69c9f51b48e containerName="etcd" probeResult=failure output="HTTP probe failed with statuscode: 503"
      ~~~
      
      2. Kube-apiserver is flooded with the below error
      
      ~~~
      }. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup localhost on 10.51.1.57:53: server misbehaving"
      W0905 08:18:39.138994      18 logging.go:59] [core] [Channel #1300 SubChannel #1304] grpc: addrConn.createTransport failed to connect to {
        "Addr": "localhost:2379",
        "ServerName": "localhost",
        "Attributes": null,
        "BalancerAttributes": null,
        "Type": 0,
        "Metadata": null
      ~~~
      
      3. As we seen localhost lookup is going through DNS server, we have verified all the 3 master node /etc/hosts file and found the localhost entry is missing from hosts file from two master nodes
      
      4. We have manually added the below localhost entry to missing master node and did a forceful redeployment of the kube-apiserver and ETCD. This helps us to stabilize the cluster and no more restart for any statics pods and no more dns lookup error in the kube-apiserver logs.
      
      ~~~
      127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
      ~~~ 
      
      As the cluster is stabilized after adding the localhost entry, we suspect the issue was due to this missing entry.
      
      We need help to find out on below points.
      
      1. How the localhost entry got disappeared from two master nodes 
      2. Are we expecting these kind of issue due to missing localhost entry

      Actual results:

      missing localhost entry in /etc/hosts

      Expected results:

      localhost entry should present on /etc/hosts

      Additional info:

       

            rhn-support-misalunk Miheer Salunke
            rhn-support-amuhamme MUHAMMED ASLAM V K
            Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: