Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13.0
Component/s: Networking / On-Prem Host Networking
Labels:
- blue
- triaged

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

I'm running a large-scale setup with 130 nodes
after applying a kubeletconfig (which triggered reboots) some nodes decided to come back with a long fqdns name meaning for oc get node instead of:

worker009

it came back as:

worker009.test495.myocp4.com

now the nodes will not rejoin the cluster for example:

worker052                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker055                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker076                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker080                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker087                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker088                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker100                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker103                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker109                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker125                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
worker128                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21

if we log in as core to worker128 for example and run:

journalctl -u kubelet.service --no-page

we can see the reason:

Jun 07 10:26:18 worker128 kubenswrapper[7122]: E0607 10:26:18.247745    7122 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes \"worker128.test495.myocp4.com\" is forbidden: node \"worker128\" is not allowed to modify node \"worker128.test495.myocp4.com\"" node="worker128.test495.myocp4.com"

but now the worst part - some nodes actually made it through and now I have duplicates that are basically the same nodes:

worker029                      NotReady,SchedulingDisabled   worker                 17h    v1.26.5+0001a21
worker029.test495.myocp4.com   Ready                         worker                 120m   v1.26.5+0001a21
worker030                      NotReady,SchedulingDisabled   worker                 17h    v1.26.5+0001a21
worker030.test495.myocp4.com   Ready                         worker                 121m   v1.26.5+0001a21

so there are 2 Issues here:
1. why did the nodes return from reboot with a long fqdns name?
2. are we allowing it or not? , and if we do why it creates duplicates and not just update the current name?

I collected must-gather logs which are available at:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/rebooted_nodes_return_with_fqdns.tar.gz

note that I have done this multiple times at 4.12, never seen this issue before though I can't tell for sure its not regression.

relates to

OCPBUGS-14918 nodes showing duplicate with oc get nodes

Closed

Assignee:: Mat Kowalski

Reporter:: Boaz Ben Shabat

QA Contact:: Sunil Choudhary

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/06/07 3:34 PM

Updated:: 2023/08/29 3:07 PM

Resolved:: 2023/08/29 3:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates