-
Feature Request
-
Resolution: Done
-
Major
-
None
-
False
-
False
-
Undefined
-
-
-
-
1. Proposed title of this feature request
Enhance how pods with RWO PVs react when the client node crashes
2. What is the nature and description of the request?
In the scenario of running a pod using an RWO volume, if the client host where the pod is running crashes (power outage, NIC down, ...) and the kubelet service in the broken node is not accessible, the pods are left in Terminating state and it is not possible to create new pods as the volumeattachment persists. We know this is the expected behaviour in Kubernetes to avoid data corruption, but this implies a manual procedure must be executed to release the volumeattachment.
We would like to create a Request For Enhancement to create an automatic procedure using self-fencing capabilities in Ceph to avoid the current need of manually releasing the volumeattachment. The current manual procedure is:
The current manual procedure is:
-After 5 minutes OpenShift detects the node is notReady and moves the workloads to another node.
-After 20 minutes, old pods are in state Terminating and new pods are in state ContainerCreating.
-As new pods are not able to start due to the fact that they are unable to attach or mount volumes, we need to manually perform the following procedure:
-Initial scenario:
$ oc get pods
NAME STATUS AGE
rbd-write-workload-generator-6c4d87b4c4-kbrlw ContainerCreating 20s
rbd-write-workload-generator-6c4d87b4c4-mjxwb Terminating 15m
-Examine the error we are getting:
$ oc get pods rbd-write-workload-generator-6c4d87b4c4-kbrlw -o yaml
...
Multi-Attach error for volume "pvc-a3f569a7-1fe7-4d2d-b561-090b2426b13d"
Volume is already used by pod(s) rbd-write-workload-generator-6c4d87b4c4-mjxwb
...
-Set environment variables:
$ OLD_POD_NAME="rbd-write-workload-generator-6c4d87b4c4-mjxwb"
$ NEW_POD_NAME="rbd-write-workload-generator-6c4d87b4c4-kbrlw"
$ PVC_NAME="pvc-a3f569a7-1fe7-4d2d-b561-090b2426b13d"
-Delete the pod in Terminating state:
$ oc delete pod ${OLD_POD_NAME} --force --grace-period=0
-Get the VolumeAttachment linked to the Persistent Volume mounted by the pod:
$ VOL_ATTACHMENT_NAME=$(oc get volumeattachment -o jsonpath="{.items[?(@.spec.source.persistentVolumeName=='${PVC_NAME}')].metadata.name}")
-Delete the VolumeAttachment object:
$ oc delete volumeattachment ${VOL_ATTACHMENT_NAME}
-Delete the pod in ContainerCreating status to force recreation:
$ oc delete pod ${NEW_POD_NAME}
-Wait until the new pod is created:
$ oc get pods
NAME STATUS AGE
rbd-write-workload-generator-6c4d87b4c4-vlrx8 Running 34s
3. Why does the customer need this? (List the business requirements here)
We are ok with the fact of having a downtime of 5 minutes until OpenShift detects the pod is down and creates a new pod. But the need of executing a manual procedure to solve this situation is not desired and should be self-managed directly by OCP/K8S.
- depends on
-
OCPSTRAT-724 Non-graceful node shutdown
- Closed
- duplicates
-
RFE-2235 Add non-graceful node shutdown to allow CSI drivers to dettach volumes in case of down node
- Accepted