-
Bug
-
Resolution: Unresolved
-
Major
-
rhos-17.1.4
-
None
-
False
-
-
False
-
None
-
-
Description of problem:
We applied HF for bug #2296989 in customer's environment to unlock huge volume to image uploads and overcome limitations introduced by local conversion. HF worked as expected (RBD images are no longer downloaded locally for conversion), but it looks like there is another issue that was hidden by bug #2296989 that now affects customer's workflows.
Customer reproduced it by simultaneously creating multiple volumes from same 50 GB image and then uploading created volumes to Glance simultaneously. Same picture was reproduced consistently: 2-3 volumes were uploaded in ~20-30 minutes, remaining ones were stuck: upload continues, but stays slow and BW made available after successfull uploads wasn't used by ongoing uploads.
RHOSP architecture is quite complex to troubleshoot network performance problems: there is a TCP connection with cinder-volume owning client side, then HAProxy terminating client connection and proxying it to Glance backend. We involved network support group to figure out which part of this scheme is not working as expected and it looks like Glance is a source of the issue (will share data and follow-ups privately).
We are looking for help from Glance engineering with this problem: we need to debug interactions with Ceph and processing inbound TCP connections, then figure out consistent conclusions on top of that. It is impossible to do with standard Glance logs and our regular debugging methods.
Version-Release number of selected component (if applicable): RHOSP 17.1
How reproducible: simultaneously create multiple volumes from single Glance image, then upload volumes to Glance
Actual results: few volumes are uploaded successfully, remaining ones are uploaded very slowly and essentially stuck
Expected results: after few successful uploads, remaining volume uploads remain very slow and essentially blocked
Additional info: will be provided privately
- external trackers