Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-28551

EFS CSI performance degradation due to CPU limits

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • None
    • 4.13, 4.12, 4.11, 4.10, 4.14, 4.15, 4.16
    • Storage
    • None
    • ?
    • Important
    • No
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, CPU limits for the {aws-first} Elastic File Store (EFS) Container Storage Interface (CSI) driver container could cause performance degradation of volumes managed by the {aws-short} EFS CSI Driver Operator. With this release, the CPU limits from the {aws-short} EFS CSI driver container are removed to help prevent potential performance degradation. (link:https://issues.redhat.com/browse/OCPBUGS-28551[*OCPBUGS-28551*]
      Show
      * Previously, CPU limits for the {aws-first} Elastic File Store (EFS) Container Storage Interface (CSI) driver container could cause performance degradation of volumes managed by the {aws-short} EFS CSI Driver Operator. With this release, the CPU limits from the {aws-short} EFS CSI driver container are removed to help prevent potential performance degradation. (link: https://issues.redhat.com/browse/OCPBUGS-28551 [* OCPBUGS-28551 *]
    • Bug Fix
    • Done

      Description of problem:

      When a EFS based volume is mounted by the driver (csi-driver) in the daemonset aws-efs-ci-driver-node a new stunnel process is also launched. This process, used to encrypt the I/O traffic of the NFS filesystem, that can be CPU intensive under load conditions, becomes throttled by the the CPU limits configured on the csi-driver container (100m) https://github.com/openshift/aws-efs-csi-driver-operator/blob/release-4.16/assets/node.yaml#L81-L83
      
      This CPU throttling is leading to a high performance degradation of all volumes managed by the operator.

      How reproducible:

      Create a pod with a EFS pvc attached and run a simple performance test on this volume
      
      i.e:
      fio --ioengine=libaio --iodepth=4 --runtime=60 --bs=1MiB --time_based=1 --
      filename=file --rw=read --size=2GiB --name=readjob --direct=1
      
      Repeat the previous test after removing cpu limits of the csi-driver container of the daemonset aws-efs-ci-driver-node. This can be done by configuring the resource ClusterCSIDriver/efs.csi.aws.com to Unmanaged state
          

      Results using the default configuration:

      sh-5.2$ fio --ioengine=libaio --iodepth=4 --runtime=60 --bs=1MiB --time_based=1 --filename=file --rw=read --size=2GiB --name=readjob --direct=1
      readjob: (g=0): rw=read, bs=(R) 977KiB-977KiB, (W) 977KiB-977KiB, (T) 977KiB-977KiB, ioengine=libaio, iodepth=4
      <truncated>
      READ: bw=95.2MiB/s (99.9MB/s), 95.2MiB/s-95.2MiB/s (99.9MB/s-99.9MB/s), io=5717MiB (5995MB), run=60031-60031msec

       

      Results after removing cpu limits

      sh-5.2$ fio --ioengine=libaio --iodepth=4 --runtime=60 --bs=1MiB --time_based=1 --filename=file --rw=read --size=2GiB --name=readjob --direct=1
      readjob: (g=0): rw=read, bs=(R) 977KiB-977KiB, (W) 977KiB-977KiB, (T) 977KiB-977KiB, ioengine=libaio, iodepth=4
      <truncated>
      READ: bw=507MiB/s (532MB/s), 507MiB/s-507MiB/s (532MB/s-532MB/s), io=29.7GiB (31.9GB), run=60006-60006msec
      

              rhn-support-tsmetana Tomas Smetana
              rsevilla@redhat.com Raul Sevilla Canavate
              Rohit Patil Rohit Patil
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: