Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14945

machine-config-daemon rprivate default mount propagation with `hostPath: path: /` breaks CSI driver relying on multipath

XMLWordPrintable

      Description of problem:

      `rprivate`  default mount propagation in combination with `hostPath: path: /` breaks CSI driver relying on multipath
      
      

      How reproducible:

      Always
      
      

      Steps to Reproduce (simplified):

      1. ssh to node, 
      2.  mount a partition (for instance) /dev/{s,v}da2 which on CoreOs is an UEFI FAT partition
          $ sudo mount /dev/vda2 /mnt
      3. start a debug pod on that node ( or any pod that does a hostPath mount of /, like the node tuning operand pod, the machine config operand, the filesystem integrity operand ) 
          $ oc debug nodes/master-2.sharedocp4upi411ovn.lab.upshift.rdu2.redhat.com
      4. unmount the partition on node
      
      5. notice the debug pod still has a reference to the filesystem
      grep vda2 /proc/*/mountinfo
      /proc/3687945/mountinfo:11219 10837 252:2 / /host/var/mnt rw,relatime - vfat /dev/vda2 rw,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro
      
      6. On the node, although the mount is absent from /proc/mounts, the file system is still mounted, as shown by the dirty bit being still set on the FAT filesystem:
      
      sudo fsck -n  /dev/vda2 
      fsck from util-linux 2.32.1
      fsck.fat 4.1 (2017-01-24)
      0x25: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
      
      

      Expected results:

      File system is unmounted in host and in container.
      

      Additional info:

      Although the steps above show the behaviour in a simple way, this becomes quite problematic when using multipath on a host mount.
      We noticed in a customer environment that we cannot reschedule some pods from old node to new node using oc adm drain when these pods have a Persistent Volume mount created by the third party CSI driver block.csi.ibm.com.

      The CSI driver is using multipath from CoreOS to manage multipath block devices, however the multipath daemon blocks the volume removal from the node (the multipath -f flushing calls from the CSI driver always return busy. Flushing a multiple device means removing it from the device tree in /dev in storage parlance)

      multipath flush are always failing because although the multipath block device is unmounted on the host, machine-config, file integrity, node tuning pods are doing hostPath volume mounts of /, the host root filesystem.
      and thus get a copy of the mounts.
      Due to that mount copy the kernel sees the filesystem is still in use, although there a no file descriptors open on that filesyste, and considers it is unsafe to remove the multipath block device, and the node CSI driver cannot finish the unmount of the volume, thus blocking the container creation on another node.

      We can see this mount copies by looking at /proc/<container pid>/mountinfo:

      $ grep mpathes proc/*/mountinfo
      proc/3295781/mountinfo:56348 52693 253:42 / /var/lib/kubelet/plugins/kubernetes.io/csi/block.csi.ibm.com/12345/globalmount rw,relatime - xfs /dev/mapper/mpathes rw,seclabel,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota
      

      cri-o is doing this mount copy using `rprivate` mount propagation
      ( see https://github.com/cri-o/cri-o/blob/b098bec2d4d79bdf99c3ce89b0eeb16bfe8b5645/server/container_create_linux.go#L1030 )

      the semantics of rprivate are mapped in`runc`
      https://github.com/opencontainers/runc/blob/ba58ee9c3b9550c3e32b94802b0fb29761955290/libcontainer/specconv/spec_linux.go#L55
      to mount flags passed to the mount(2) systemcall

      MS_REC (since Linux 2.4.11)
                    Used  in  conjunction  with  MS_BIND to create a recursive bind mount, and in
                    conjunction with the propagation type flags to recursively change the  propa‐
                    gation  type  of  all  of the mounts in a subtree.  See below for further de‐
                    tails.
      
      MS_PRIVATE
                    Make this mount private.  Mount and unmount events do not propagate  into  or
                    out of this mount.
      

      the key is the MS_PRIVATE mount here. The unmounting of the multipath block device is not propagated to the mount namespace of containers, thus keeping the filesystem eternally mounted, preventing the flushing of the multipath device.

      Maybe hostPath mounts should be done using `rslave` mount propagation, when we see we try to bind mount /var/lib ?
      Seems cri-dockerd is doing something similar according to https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation

            cdoern@redhat.com Charles Doern
            rhn-support-ekasprzy Emmanuel Kasprzyk
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: