Description of problem:
When a EFS based volume is mounted by the driver (csi-driver) in the daemonset aws-efs-ci-driver-node a new stunnel process is also launched. This process, used to encrypt the I/O traffic of the NFS filesystem, that can be CPU intensive under load conditions, becomes throttled by the the CPU limits configured on the csi-driver container (100m) https://github.com/openshift/aws-efs-csi-driver-operator/blob/release-4.16/assets/node.yaml#L81-L83 This CPU throttling is leading to a high performance degradation of all volumes managed by the operator.
How reproducible:
Create a pod with a EFS pvc attached and run a simple performance test on this volume i.e: fio --ioengine=libaio --iodepth=4 --runtime=60 --bs=1MiB --time_based=1 -- filename=file --rw=read --size=2GiB --name=readjob --direct=1 Repeat the previous test after removing cpu limits of the csi-driver container of the daemonset aws-efs-ci-driver-node. This can be done by configuring the resource ClusterCSIDriver/efs.csi.aws.com to Unmanaged state
Results using the default configuration:
sh-5.2$ fio --ioengine=libaio --iodepth=4 --runtime=60 --bs=1MiB --time_based=1 --filename=file --rw=read --size=2GiB --name=readjob --direct=1 readjob: (g=0): rw=read, bs=(R) 977KiB-977KiB, (W) 977KiB-977KiB, (T) 977KiB-977KiB, ioengine=libaio, iodepth=4 <truncated> READ: bw=95.2MiB/s (99.9MB/s), 95.2MiB/s-95.2MiB/s (99.9MB/s-99.9MB/s), io=5717MiB (5995MB), run=60031-60031msec
Results after removing cpu limits
sh-5.2$ fio --ioengine=libaio --iodepth=4 --runtime=60 --bs=1MiB --time_based=1 --filename=file --rw=read --size=2GiB --name=readjob --direct=1 readjob: (g=0): rw=read, bs=(R) 977KiB-977KiB, (W) 977KiB-977KiB, (T) 977KiB-977KiB, ioengine=libaio, iodepth=4 <truncated> READ: bw=507MiB/s (532MB/s), 507MiB/s-507MiB/s (532MB/s-532MB/s), io=29.7GiB (31.9GB), run=60006-60006msec
- blocks
-
OCPBUGS-28645 EFS CSI performance degradation due to CPU limits
- Closed
- is cloned by
-
OCPBUGS-28645 EFS CSI performance degradation due to CPU limits
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update