-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.13.0
-
Important
-
No
-
Rejected
-
False
-
I'm running a large-scale setup with 130 nodes
after applying a kubeletconfig (which triggered reboots) some nodes decided to come back with a long fqdns name meaning for oc get node instead of:
worker009
it came back as:
worker009.test495.myocp4.com
now the nodes will not rejoin the cluster for example:
worker052 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker055 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker076 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker080 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker087 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker088 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker100 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker103 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker109 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker125 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21 worker128 NotReady,SchedulingDisabled worker 21h v1.26.5+0001a21
if we log in as core to worker128 for example and run:
journalctl -u kubelet.service --no-page
we can see the reason:
Jun 07 10:26:18 worker128 kubenswrapper[7122]: E0607 10:26:18.247745 7122 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes \"worker128.test495.myocp4.com\" is forbidden: node \"worker128\" is not allowed to modify node \"worker128.test495.myocp4.com\"" node="worker128.test495.myocp4.com"
but now the worst part - some nodes actually made it through and now I have duplicates that are basically the same nodes:
worker029 NotReady,SchedulingDisabled worker 17h v1.26.5+0001a21 worker029.test495.myocp4.com Ready worker 120m v1.26.5+0001a21 worker030 NotReady,SchedulingDisabled worker 17h v1.26.5+0001a21 worker030.test495.myocp4.com Ready worker 121m v1.26.5+0001a21
so there are 2 Issues here:
1. why did the nodes return from reboot with a long fqdns name?
2. are we allowing it or not? , and if we do why it creates duplicates and not just update the current name?
I collected must-gather logs which are available at:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/rebooted_nodes_return_with_fqdns.tar.gz
note that I have done this multiple times at 4.12, never seen this issue before though I can't tell for sure its not regression.
- relates to
-
OCPBUGS-14918 nodes showing duplicate with oc get nodes
- Closed