Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Networking / DNS
Labels:
None

Severity:
Important
Regression:
No
Story Points:
1
Sprint:
Sprint 244, Sprint 245, Sprint 246
sprint_count:
3
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

All the static pods (kube-apiserver/KCM/ETCD) from the cluster is getting restarted frequently.

Version-Release number of selected component (if applicable):

How reproducible:

Not reproduced yet

Steps to Reproduce:

Below are the observations.

1. The ETCD pods are getting frequently restarted due to probe failures

~~~
Sep 04 23:12:00 jprocpuatmst03.ocpcorpuat.icicibankltd.com kubenswrapper[2140]: I0904 23:12:00.536323    2140 patch_prober.go:29] interesting pod/xxxx container/etcd namespace/openshift-etcd: Liveness probe status=failure output="HTTP probe failed with statuscode: 503" start-of-body=failed to establish etcd client: giving up getting a cached client after 3 tries
Sep 04 23:12:00 xxxx kubenswrapper[2140]: I0904 23:12:00.536391    2140 prober.go:114] "Probe failed" probeType="Liveness" pod="openshift-etcd/xxxx" podUID=8e435d267e850090d4a8e69c9f51b48e containerName="etcd" probeResult=failure output="HTTP probe failed with statuscode: 503"
Sep 04 23:12:32 xxxx kubenswrapper[2140]: I0904 23:12:32.734253    2140 patch_prober.go:29] interesting pod/etcd-xxxx container/etcd namespace/openshift-etcd: Liveness probe status=failure output="HTTP probe failed with statuscode: 503" start-of-body=failed to establish etcd client: giving up getting a cached client after 3 tries
Sep 04 23:12:32 xxxx kubenswrapper[2140]: I0904 23:12:32.734339    2140 prober.go:114] "Probe failed" probeType="Liveness" pod="openshift-etcd/xxxx" podUID=8e435d267e850090d4a8e69c9f51b48e containerName="etcd" probeResult=failure output="HTTP probe failed with statuscode: 503"
~~~

2. Kube-apiserver is flooded with the below error

~~~
}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup localhost on 10.51.1.57:53: server misbehaving"
W0905 08:18:39.138994      18 logging.go:59] [core] [Channel #1300 SubChannel #1304] grpc: addrConn.createTransport failed to connect to {
  "Addr": "localhost:2379",
  "ServerName": "localhost",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
~~~

3. As we seen localhost lookup is going through DNS server, we have verified all the 3 master node /etc/hosts file and found the localhost entry is missing from hosts file from two master nodes

4. We have manually added the below localhost entry to missing master node and did a forceful redeployment of the kube-apiserver and ETCD. This helps us to stabilize the cluster and no more restart for any statics pods and no more dns lookup error in the kube-apiserver logs.

~~~
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
~~~ 

As the cluster is stabilized after adding the localhost entry, we suspect the issue was due to this missing entry.

We need help to find out on below points.

1. How the localhost entry got disappeared from two master nodes 
2. Are we expecting these kind of issue due to missing localhost entry

Actual results:

missing localhost entry in /etc/hosts

Expected results:

localhost entry should present on /etc/hosts

Additional info:

is caused by

OCPBUGS-19933 cluster-dns-operator corrupts /etc/hosts when fs full

Closed

Assignee:: Miheer Salunke

Reporter:: MUHAMMED ASLAM V K

QA Contact:: Melvin Joseph

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/09/07 9:00 AM

Updated:: 2024/10/25 3:04 AM

Resolved:: 2023/12/05 9:13 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates