-
Bug
-
Resolution: Unresolved
-
Critical
-
None
For small object sizes of 15KiB, with 100 buckets each with 667 objects/bucket (~ 1 GB total dataset size), MCG is unable to saturate the underlying RGW performance in baremetal ODF environment. We ran tests directly on RGW and then on MCG which used RGW backing store. We see a significant differnence in performance. Here's the summarized results:
Seqread : 1725 OPS on RGW whereas 947 OPS on MCG (45% Less performance with MCG)
Seqwrite: 1138 OPS on RGW whereas 240 OPS on MCG (79% Less performance with MCG)
Randread : 1709 OPS on RGW whereas 1037 OPS on MCG (39% Less performance with MCG)
Randwrite: 1132 OPS on RGW whereas 237 OPS on MCG (79% Less performance with MCG)
If you see the numbers above it's evident that RGW is not the bottlneck. Note, MCG was tuned according to the KCS article: https://access.redhat.com/solutions/6719951 . We did not see any resource bottlenck on Noobaa DB, core or Endpoints POD.
We have determined that the problem is in the MCG stack. We have tried the following with no improvement:
- Increased the Noobaa DB memory to 8 and eventually to 16
- Increase the minimum Endpoints count to 6
- Increase the load on the system by increasing Cosbench workers (from 88 to 176 to 352) and drivers (from 4 to 8) to push the system more hard
- Try to run two concurrent loads on the system (This actually divides the write perf to ~115 OPS on each, it points to some hard bottleneck in the IO stack of Noobaa)
- Increase the PG count on RGW data pool to 128 from default of 32 with auto scale off
oc version
Client Version: 4.10.18
Server Version: 4.10.15
Kubernetes Version: v1.23.5+3afdacb
ocos get csv
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.10.4 NooBaa Operator 4.10.4 mcg-operator.v4.10.3 Succeeded
ocs-operator.v4.10.4 OpenShift Container Storage 4.10.4 ocs-operator.v4.10.3 Succeeded
odf-csi-addons-operator.v4.10.4 CSI Addons 4.10.4 odf-csi-addons-operator.v4.10.3 Succeeded
odf-operator.v4.10.4 OpenShift Data Foundation 4.10.4 odf-operator.v4.10.3 Succeeded
How reproducible: consistently
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? no - just poor write performance observed
Is there any workaround available to the best of your knowledge? no
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 1
Can this issue reproducible? yes
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Capture MCG performance
2. Capture underlying RGW performance
3. Compare the two
Additional info:
Output of oc cluster-info dump -n openshift-storage --output-directory="ocs-pod-logs" can be found here: http://perf1.perf.lab.eng.bos.redhat.com/shberry/MCG_rgw/