Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.20.0
Component/s: Storage / Kubernetes
Labels:
- CEE.neXT

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Intro for force storage detach:
https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#storage-force-detach-on-timeout
In drivers directly exposing LUNs, this bypasses the unstage flow where multipath -f is invoked and goes straight to unpublish which is unmapping the LUN from the per-node igroup.

While it's nice to have autopilot, the consequences could be corruption on this kind of driver.

We are not the first ones to get confused over this, an upstream issue:
https://github.com/kubernetes/kubernetes/issues/120328
which resulted in a PR that allows disabling this behavior:
https://github.com/kubernetes/kubernetes/pull/120344

This bug is about considering disabling this by default on OCP

Version-Release number of selected component (if applicable):

    OCP 4.20.0

How reproducible:

    100%

Steps to Reproduce:

    1. Look up flag in controller manager 
    2.
    3.

Actual results:

    Enabled

Expected results:

    Disabled

Additional info:

What needs to happen instead - notice the node is not ready, and then use an out-of-band means to kill the node (powering off the VM or physical node) and taint the node so that all its volumes get force-detached. This way, there is no garbage left on the node as a result of the force-detach. The crucial thing is that someone actually makes sure the node is dead and then taints it -- not a 6-minute timer expiring while the node is actually perfectly fine and simply not reachable by the API server for a brief millisecond.

To invoke the forced detach mechanism:
- create a pod with a volume on said driver that does nothing
- exec into the node and exec 3</var/lib/kubelet/pods/pod-uid/volumeDevices/kubernetes.io~csi/pvc-pvc-uid (keep session open)
- delete pod
- observe errors in the driver unstaging
- after 6 minutes issue a systemctl restart kubelet on the node
- observe force detach log with
kubectl get pods -n openshift-kube-controller-manager --no-headers | awk '{print $1}' | xargs -I {} sh -c 'kubectl logs -n openshift-kube-controller-manager {} --all-containers --prefix | grep "force detaching"'

Assignee:: Hemant Kumar

Reporter:: Alex Kalenyuk

Need Info From:: None

Contributors:: None

QA Contact:: Wei Duan

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/08/31 1:43 PM

Updated:: 2025/09/13 10:48 PM

Resolved:: 2025/09/09 7:37 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates