Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37289

Issue with upgrade was reported with a node in NotReady state due to NM issues.

XMLWordPrintable

    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

          - Issue with upgrade was reported with a node in NotReady state.
          - The node was running and was able to ssh though the kubelet was failing.
          - As a solution we forced the config to node and it fixed the issue.
      But
         - The issue can be seen on other node as well.
         - The "nodeip-configuration.service" fails and that fails the kubelet service.

      So we gathered sosreport and we see some doubtful messages on duplicate IP over NM as,

      May 26 03:52:00 <node_hostname> NetworkManager[1330]: <warn>  [1716695520.3453] ipv6ll[cf77b8ea4343619a,ifindex=2]: changed: no IPv6 link local address to retry after Duplicate Address Detection failures (back off)
      May 26 03:52:10 <node_hostname> kubenswrapper[2072]: I0526 03:52:10.297282    2072 kubelet_getters.go:182] "Pod status updated" pod="openshift-vsphere-infra/keepalived-<node_hostname>" status=Running
      May 26 03:52:10 <node_hostname> kubenswrapper[2072]: I0526 03:52:10.297390    2072 kubelet_getters.go:182] "Pod status updated" pod="openshift-vsphere-infra/coredns-<node_hostname>" status=Running
      May 26 03:52:10 <node_hostname> NetworkManager[1330]: <warn>  [1716695530.3469] platform-linux: do-add-ip6-address[2: fe80::ba32:xxxx:xxx:xxxx]: failure 95 (Operation not supported)
      May 26 03:52:12 <node_hostname> NetworkManager[1330]: <warn>  [1716695532.3493] platform-linux: do-add-ip6-address[2: fe80::957a:xxxx:xxxx:xxxx]: failure 95 (Operation not supported)

      The reason behind the failure of nodeip-configuration seems as,

       15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1682]: ++ iface=ens192
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1682]: ++ [[ -z ens192 ]]
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1682]: ++ echo ens192
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1680]: + iface=ens192
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1680]: + echo 'Node IP interface determined as: ens192. Enabling IP forwarding...'
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1680]: Node IP interface determined as: ens192. Enabling IP forwarding...
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1680]: + echo 1
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1680]: + echo 1
      Jul 15 19:29:19 <node_hostname> configure-ip-forwarding.sh[1680]: /usr/local/bin/configure-ip-forwarding.sh: line 44: /proc/sys/net/ipv6/conf/ens192/forwarding: No such file or directory
      Jul 15 19:29:19 <node_hostname> systemd[1]: nodeip-configuration.service: Control process exited, code=exited, status=1/FAILURE
      Jul 15 19:29:19 <node_hostname> systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
      Jul 15 19:29:19 <node_hostname> systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

      So what we conclude is,

      IPv6 is by default enabled in the OpenShift cluster and the cluster has
      configs to disable it in additional machine configs.

      Issue is only observed and reported during an upgrade with a node in NotReady
      state. So, here it can be questioned why all of a sudden IPv6 is creating an
      issue when initially it was running fine. Also, after issue is seen, we have a
      workaround by forcing the config to node and it fixed the issue though
      customer would like to know root cause as this is likely a bug.

      Version-Release number of selected component (if applicable):

      Upgrade 4.13.39 to 4.14.31
          

      How reproducible:

           While upgrade

      Steps to Reproduce:

          1. Perform upgrade from 4.13.39 to 4.14.31
          2.
          3.
          

      Actual results:

      Node fail to apply update due to kubelet failure.
          

      Expected results:

          
      Shouldn't fail.

      Additional info:

      Logs on drive : https://drive.google.com/drive/folders/1ls1yAByyzK-Z20i0niqyN9w0IjzrT1pd?usp=sharing ( A Must-Gather and sosreport from affected node)
          

              bnemec@redhat.com Benjamin Nemec
              rhn-support-pkhedeka Parikshit Khedekar
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: