Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-906

dataloss due to the concurrent RPC calls (occurrence is very low)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.18
    • csi-driver
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • ?
    • ?
    • Committed
    • Critical
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

       

      Race condition in RBD node plugin may lead to total data loss while unmounting during the NodeUnpublish RPC calls, 
      If there are two parallel NodePublish RPC calls for the same pvc and same target path, we have a problem.
      This parallel PRC cases the node plugin to bindmount the volume again over the target volume, WITHOUT unmounting the previous bindmount.
      When the kubelet invokes a /csi.v1.Node/NodeUnpublishVolume call to the node plugin.
      The node plugin unmounts the volume, and calls os.RemoveAll on the target path. HOWEVER, since mount was invoked twice on that volume, there is still an active bind mount to the actual RBD volume in that directory, causing os.RemoveAll to recursively remove the entire contents of that volume.

      The occurance of hitting this is low but we need to fix this to avoid data loss

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

       

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

       

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

       

       

      Does this issue impact your ability to continue to work with the product?

       

       

      Is there any workaround available to the best of your knowledge?

       

       

      Can this issue be reproduced? If so, please provide the hit rate

       

       

      Can this issue be reproduced from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      1.

      2.

      3.

      The exact date and time when the issue was observed, including timezone details:

       

      Actual results:

       

       

      Expected results:

       

      Logs collected and log location:

       

      Additional info:

       

              ypadia@redhat.com Yati Padia
              mrajanna@redhat.com Madhu R
              Yati Padia
              Yulia Persky Yulia Persky
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated: