-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.12.z
-
None
-
Important
-
No
-
1
-
Sprint 244, Sprint 245, Sprint 246
-
3
-
Rejected
-
False
-
Description of problem:
All the static pods (kube-apiserver/KCM/ETCD) from the cluster is getting restarted frequently.
Version-Release number of selected component (if applicable):
How reproducible:
Not reproduced yet
Steps to Reproduce:
Below are the observations. 1. The ETCD pods are getting frequently restarted due to probe failures ~~~ Sep 04 23:12:00 jprocpuatmst03.ocpcorpuat.icicibankltd.com kubenswrapper[2140]: I0904 23:12:00.536323 2140 patch_prober.go:29] interesting pod/xxxx container/etcd namespace/openshift-etcd: Liveness probe status=failure output="HTTP probe failed with statuscode: 503" start-of-body=failed to establish etcd client: giving up getting a cached client after 3 tries Sep 04 23:12:00 xxxx kubenswrapper[2140]: I0904 23:12:00.536391 2140 prober.go:114] "Probe failed" probeType="Liveness" pod="openshift-etcd/xxxx" podUID=8e435d267e850090d4a8e69c9f51b48e containerName="etcd" probeResult=failure output="HTTP probe failed with statuscode: 503" Sep 04 23:12:32 xxxx kubenswrapper[2140]: I0904 23:12:32.734253 2140 patch_prober.go:29] interesting pod/etcd-xxxx container/etcd namespace/openshift-etcd: Liveness probe status=failure output="HTTP probe failed with statuscode: 503" start-of-body=failed to establish etcd client: giving up getting a cached client after 3 tries Sep 04 23:12:32 xxxx kubenswrapper[2140]: I0904 23:12:32.734339 2140 prober.go:114] "Probe failed" probeType="Liveness" pod="openshift-etcd/xxxx" podUID=8e435d267e850090d4a8e69c9f51b48e containerName="etcd" probeResult=failure output="HTTP probe failed with statuscode: 503" ~~~ 2. Kube-apiserver is flooded with the below error ~~~ }. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup localhost on 10.51.1.57:53: server misbehaving" W0905 08:18:39.138994 18 logging.go:59] [core] [Channel #1300 SubChannel #1304] grpc: addrConn.createTransport failed to connect to { "Addr": "localhost:2379", "ServerName": "localhost", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null ~~~ 3. As we seen localhost lookup is going through DNS server, we have verified all the 3 master node /etc/hosts file and found the localhost entry is missing from hosts file from two master nodes 4. We have manually added the below localhost entry to missing master node and did a forceful redeployment of the kube-apiserver and ETCD. This helps us to stabilize the cluster and no more restart for any statics pods and no more dns lookup error in the kube-apiserver logs. ~~~ 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ~~~ As the cluster is stabilized after adding the localhost entry, we suspect the issue was due to this missing entry. We need help to find out on below points. 1. How the localhost entry got disappeared from two master nodes 2. Are we expecting these kind of issue due to missing localhost entry
Actual results:
missing localhost entry in /etc/hosts
Expected results:
localhost entry should present on /etc/hosts
Additional info:
- is caused by
-
OCPBUGS-19933 cluster-dns-operator corrupts /etc/hosts when fs full
- Closed