Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42841

The nodes get auto-removed from the cluster during cluster outage

XMLWordPrintable

    • Critical
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

          The nodes got auto-removed when Vsphere IPI cluster faced network outage. Once the network outage issue fixed, the node seems to have rejoined the cluster after reboot.
      

      Version-Release number of selected component (if applicable):

          4.15.24, 4.14.11 Vsphere IPI

      Steps to Reproduce:

      KCM logs:
      ~~~
      2024-10-01T10:50:16.721201726+00:00 stderr F I1001 10:50:16.721189       1 attach_detach_controller.go:585] "Error removing node from desired-state-of-world" node="ocp-prod-s2xk6-infra-sd78r" err="failed to delete node \"ocp-prod-s2xk6-infra-sd78r\" from list of nodes managed by attach/detach controller--the node still contains 4 volumes in its list of volumes to attach"
      2024-10-01T10:50:16.833345406+00:00 stderr F I1001 10:50:16.832134       1 attach_detach_controller.go:585] "Error removing node from desired-state-of-world" node="ocp-prod-s2xk6-worker-hkxnh" err="failed to delete node \"ocp-prod-s2xk6-worker-hkxnh\" from list of nodes managed by attach/detach controller--the node still contains 1 volumes in its list of volumes to attach"
      2024-10-01T10:50:16.968415270+00:00 stderr F I1001 10:50:16.968408       1 attach_detach_controller.go:585] "Error removing node from desired-state-of-world" node="ocp-prod-s2xk6-worker-k7tfw" err="failed to delete node \"ocp-prod-s2xk6-worker-k7tfw\" from list of nodes managed by attach/detach controller--the node still contains 2 volumes in its list of volumes to attach"
      ~~~     
      ~~~
      $ oc get nodes | grep 3d
      
      ocp-prod-s2xk6-master-2       Ready                      control-plane,master   3d    v1.28.11+add48d0
      ocp-prod-s2xk6-worker-hkxnh   Ready                      worker                 3d    v1.28.11+add48d0
      ocp-prod-s2xk6-worker-k7tfw   Ready,SchedulingDisabled   worker                 3d    v1.28.11+add48d0
      ocp-prod-s2xk6-worker-rmcsp   Ready,SchedulingDisabled   worker                 3d    v1.28.11+add48d0
      ocp-prod-s2xk6-infra-sd78r    Ready                      infra,worker           3d    v1.28.11+add48d0
      ~~~

      Actual results:

          The node is getting auto-removed from the cluster.

      Expected results:

          The nodes should go to NotReady state only, but should not be removed from the cluster.

      Additional info:

          Must-gather of the affected cluster will be uploaded.

              raryan@redhat.com Rachel Ryan
              rhn-support-dpateriy Divyam Pateriya
              Paige Patton Paige Patton
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: