-
Bug
-
Resolution: Done
-
Major
-
None
-
4.12
-
None
Description of problem:
`rprivate` default mount propagation in combination with `hostPath: path: /` breaks CSI driver relying on multipath
How reproducible:
Always
Steps to Reproduce (simplified):
1. ssh to node, 2. mount a partition (for instance) /dev/{s,v}da2 which on CoreOs is an UEFI FAT partition $ sudo mount /dev/vda2 /mnt 3. start a debug pod on that node ( or any pod that does a hostPath mount of /, like the node tuning operand pod, the machine config operand, the filesystem integrity operand ) $ oc debug nodes/master-2.sharedocp4upi411ovn.lab.upshift.rdu2.redhat.com 4. unmount the partition on node 5. notice the debug pod still has a reference to the filesystem grep vda2 /proc/*/mountinfo /proc/3687945/mountinfo:11219 10837 252:2 / /host/var/mnt rw,relatime - vfat /dev/vda2 rw,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro 6. On the node, although the mount is absent from /proc/mounts, the file system is still mounted, as shown by the dirty bit being still set on the FAT filesystem: sudo fsck -n /dev/vda2 fsck from util-linux 2.32.1 fsck.fat 4.1 (2017-01-24) 0x25: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
Expected results:
File system is unmounted in host and in container.
Additional info:
Although the steps above show the behaviour in a simple way, this becomes quite problematic when using multipath on a host mount.
We noticed in a customer environment that we cannot reschedule some pods from old node to new node using oc adm drain when these pods have a Persistent Volume mount created by the third party CSI driver block.csi.ibm.com.
The CSI driver is using multipath from CoreOS to manage multipath block devices, however the multipath daemon blocks the volume removal from the node (the multipath -f flushing calls from the CSI driver always return busy. Flushing a multiple device means removing it from the device tree in /dev in storage parlance)
multipath flush are always failing because although the multipath block device is unmounted on the host, machine-config, file integrity, node tuning pods are doing hostPath volume mounts of /, the host root filesystem.
and thus get a copy of the mounts.
Due to that mount copy the kernel sees the filesystem is still in use, although there a no file descriptors open on that filesyste, and considers it is unsafe to remove the multipath block device, and the node CSI driver cannot finish the unmount of the volume, thus blocking the container creation on another node.
We can see this mount copies by looking at /proc/<container pid>/mountinfo:
$ grep mpathes proc/*/mountinfo
proc/3295781/mountinfo:56348 52693 253:42 / /var/lib/kubelet/plugins/kubernetes.io/csi/block.csi.ibm.com/12345/globalmount rw,relatime - xfs /dev/mapper/mpathes rw,seclabel,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota
cri-o is doing this mount copy using `rprivate` mount propagation
( see https://github.com/cri-o/cri-o/blob/b098bec2d4d79bdf99c3ce89b0eeb16bfe8b5645/server/container_create_linux.go#L1030 )
the semantics of rprivate are mapped in`runc`
https://github.com/opencontainers/runc/blob/ba58ee9c3b9550c3e32b94802b0fb29761955290/libcontainer/specconv/spec_linux.go#L55
to mount flags passed to the mount(2) systemcall
MS_REC (since Linux 2.4.11) Used in conjunction with MS_BIND to create a recursive bind mount, and in conjunction with the propagation type flags to recursively change the propa‐ gation type of all of the mounts in a subtree. See below for further de‐ tails. MS_PRIVATE Make this mount private. Mount and unmount events do not propagate into or out of this mount.
the key is the MS_PRIVATE mount here. The unmounting of the multipath block device is not propagated to the mount namespace of containers, thus keeping the filesystem eternally mounted, preventing the flushing of the multipath device.
Maybe hostPath mounts should be done using `rslave` mount propagation, when we see we try to bind mount /var/lib ?
Seems cri-dockerd is doing something similar according to https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation
- clones
-
OCPBUGS-14946 tuned daemonset rprivate default mount propagation with `hostPath: path: /` volumeMount breaks CSI driver relying on multipath
- Closed
- depends on
-
OCPBUGS-14946 tuned daemonset rprivate default mount propagation with `hostPath: path: /` volumeMount breaks CSI driver relying on multipath
- Closed
- split from
-
OCPBUGS-14351 rprivate default mount propagation in combination with `hostPath: path: /` breaks CSI driver relying on multipath
- Closed
- links to