-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.20.z
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On 4.20 it was discovered that kubelet does not allow enough time for NSS dns lookups to time out and fallback to myhostname. The code at issue is here: https://github.com/openshift/kubernetes/blame/release-4.20/pkg/volume/csi/csi_plugin.go#L384-L404 Even though it says it will wait up to 140s it isn't true as the function runs the first step without any delays, so there are just 5 delays adding up to about 23s. A demonstration of this is at https://go.dev/play/p/NEtWKYdF4x- Additional context: https://redhat-internal.slack.com/archives/C09SCTRBK7Z/p1764239834038539?thread_ts=1764167382.927139&cid=C09SCTRBK7Z
Version-Release number of selected component (if applicable):
4.20.z
How reproducible:
Always
Steps to Reproduce:
1. Create a new worker node whose DNS server will time out
2. Observe that kubelet will never create the Node object
3.
Actual results:
No Node object is created, instead kubelet continually errors out and is restarted
Expected results:
Kubelet creates the Node object and the worker joins the cluster successfully
Additional info:
This is a result of the investigation into OCPBUGS-64883. Currently ARO has declared this an upgrade risk for 4.20.
See also incident Slack channel #itn-2025-00296 https://redhat.enterprise.slack.com/archives/C09SCTRBK7Z
- relates to
-
OCPBUGS-64883 Workers don't create their Node objects on ARO
-
- New
-