Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14692

rebooted nodes return with long fqdns name, and create duplicates workers

XMLWordPrintable

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      I'm running a large-scale setup with 130 nodes
      after applying a kubeletconfig (which triggered reboots) some nodes decided to come back with a long fqdns name meaning for oc get node instead of:

      worker009

      it came back as:

      worker009.test495.myocp4.com

      now the nodes will not rejoin the cluster for example:

      worker052                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker055                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker076                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker080                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker087                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker088                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker100                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker103                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker109                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker125                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21
      worker128                      NotReady,SchedulingDisabled   worker                 21h     v1.26.5+0001a21

      if we log in as core to worker128 for example and run:

      journalctl -u kubelet.service --no-page

      we can see the reason:

      Jun 07 10:26:18 worker128 kubenswrapper[7122]: E0607 10:26:18.247745    7122 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes \"worker128.test495.myocp4.com\" is forbidden: node \"worker128\" is not allowed to modify node \"worker128.test495.myocp4.com\"" node="worker128.test495.myocp4.com"

      but now the worst part -  some nodes actually made it through and now I have duplicates that are basically the same nodes:

      worker029                      NotReady,SchedulingDisabled   worker                 17h    v1.26.5+0001a21
      worker029.test495.myocp4.com   Ready                         worker                 120m   v1.26.5+0001a21
      worker030                      NotReady,SchedulingDisabled   worker                 17h    v1.26.5+0001a21
      worker030.test495.myocp4.com   Ready                         worker                 121m   v1.26.5+0001a21

      so there are 2 Issues here:
      1. why did the nodes return from reboot with a long fqdns name?
      2. are we allowing it or not? , and if we do why it creates duplicates and not just update the current name?

      I collected must-gather logs which are available  at:
      http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/rebooted_nodes_return_with_fqdns.tar.gz

      note that I have done this multiple times at 4.12, never seen this issue before though I can't tell for sure its not regression.

              mkowalsk@redhat.com Mat Kowalski
              bbenshab Boaz Ben Shabat
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: