+++ This bug was initially created as a clone of Bug #2238906 +++
Description of problem:
During large (>500GB) content exports, there is a massive increase in time taken for the export and this is due to a very long running postgres thread prior to the export files being written.
This BZ is raised in relation to support case #03609607.
During a 512GB export, 21hours was spent by the following processes before any export files are written:
100% CPU util for: "postgres: parallel worker for PID 12890"
PID 12890: "postgres: pulp pulpcore ::1946170) SELECT"
Version-Release number of selected component (if applicable):
6.11.5.4
How reproducible:
Everytime
Steps to Reproduce:
Testing has been done on a 512GB export that contains:
Repositories: 27 (both Red Hat and EPEL)
RPM Packages: ~166,000
Size of export: 512GB
Command run: time hammer content-export complete cersion --content-view testexport --version 2 --organization-id 1 --chunk-size-gb 200
Actual results:
Time to create: ~25.5hrs
Expected results:
Ideally less than 6hrs to create the export.
Additional info:
Please see the support case #03609607 which has a lot more contextual information.
The customer require being able to produce +2.1TB exports in a timely manner.
Testing of exporting a 2.1TB export that contains:
Repositories: 151 (both Red Hat and 3rd party repos)
RPM Packages: >200,000
Size of export: 2.1TB
Time to create: ~37hrs
Why we think this needs to be looked at is due to the time taken for a smaller export of 300GB:
Repositories: 9 (both Red Hat and EPEL)
RPM Packages: ~130,000
Size of export: 306GB
Time to create: 3hrs 25mins
There is a non-linear growth in the time taken for larger exports.
— Additional comment from on 2023-09-15T02:40:08Z
Hi team,
I am enabling the Escalation flag due to the business impact below:
Business Impact
===============
- Australian Signals Directorate (ASD) is a vital member of Australia’s national security community.
- The issue is severely impacting the customer's ability to update to a supported Satellite release.
- It is blocking RHEL9 from being brought into the environment, as a result, this is preventing a large Red Hat OpenStack instance from being upgraded to 17.
- A $3m standard to premium support upgrade deal for OpenStack could be at risk as the customer is questioning the value of Premium Support from this case(for 3+ months)
Background
==========
- Red Hat works across the full spectrum of operations required of contemporary signals intelligence and security agencies. This includes intelligence, cyber security and offensive cyber operations in support of the Australian Government and Australian Defence Force (ADF).
Expectation
===========
- A patch or update which will allow for large imports/exports (2-3TB) to be completed within one business day
- Timeline: 31 Oct 2023
Any input/thoughts is highly appreciated!
Kind regards,
Jennie
Senior Escalation Manager
— Additional comment from on 2023-09-28T02:52:46Z
Hao and I have been looking at this issue. We have found that it is actually somewhat tricky to reproduce, you need a certain combination of repos and a certain scale to get there.
For instance, exporting all of these repos will allow the issue to be reproduced:
Red Hat Enterprise Linux 6 Server Kickstart x86_64 6.10
Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server
Red Hat Enterprise Linux 8 for x86_64 - AppStream Kickstart 8.8
Red Hat Enterprise Linux 8 for x86_64 - AppStream RPMs 8
Red Hat Enterprise Linux 8 for x86_64 - BaseOS Kickstart 8.8
Red Hat Enterprise Linux 8 for x86_64 - BaseOS RPMs 8
But removing any one of them will not. I'm not claiming it's the only combination that will trigger this, just that it's what we stumbled across. It may not have anything to do with the specific repos but rather the scale, number of RPMs, etc.
Hao uncovered that the likely culprit is the implementation of "artifacts_for_version" in pulp_rpm. A tweaked implementation yielded significant benefits in a simple python shell test, and we're currently working on testing it with an actual export and working out the best implementation.
— Additional comment from on 2023-09-28T05:45:23Z
Keith, could you apply this patch (courtesy of Hao Yu) to the reproduction box, try another export and report back on how it goes? Keep the slow query logging as we'll still want to see what else is there.
Ina suggested a few potential further improvements to this query, but we know this works pretty well as-is.
diff --git a/pulp_rpm/app/models/repository.py b/pulp_rpm/app/models/repository.py
index dee378f2..d68e4ebe 100644
— a/pulp_rpm/app/models/repository.py
+++ b/pulp_rpm/app/models/repository.py
@@ -264,11 +264,11 @@ class RpmRepository(Repository):
django.db.models.QuerySet: The artifacts that are contained within this version.
"""
- qs = Artifact.objects.filter(content_pk_in=version.content)
+ artifacts_pk = set(Artifact.objects.filter(content_pk_in=version.content).values_list("pulp_id", flat=True))
for tree in DistributionTree.objects.filter(pk__in=version.content): - qs |= tree.artifacts()
+ artifacts_pk |= set(tree.artifacts().values_list("pulp_id", flat=True))
- return qs
+ return Artifact.objects.filter(pk__in=artifacts_pk)
class Meta:
default_related_name = "%(app_label)s_%(model_name)s"
— Additional comment from on 2023-09-28T10:25:11Z
In my test machine with the repositories setup mentioned in comment #2, the long query took about 36mins to complete.
-------------------
2023-09-28 13:56:01 AEST LOG: duration: 2201511.678 ms statement: SELECT "core_artifact"."pulp_id", "core_artifact"."pulp_created", "core_artifact"."pulp_last_updated", "core_artifact"."file", "core_artifact"."size", "core_artifact"."md5", "core_artifact"."sha1", "core_artifact"."sha224", "core_artifact"."sha256", "core_artifact"."sha384", "core_artifact"."sha512", "core_artifact"."timestamp_of_interest" FROM "core_artifact" INNER JOIN "core_contentartifact" ON ("core_artifact"."pulp_id" = "core_contentartifact"."artifact_id") WHERE ("core_contentartifact"."content_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '739a98ff-df61-4bbd-9be6-ede4ef23571b'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL)))) OR "core_contentartifact"."content_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE (V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '780b8895-47bb-42f5-8297-b86922852eb3'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = 'be28f8de-8143-47f9-98bc-a2de0eacdd13'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = 'da3015d3-1305-4ab1-9cfc-09fd90c990d4'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '5f47f38d-f342-4bdb-8168-b8f151f391a1'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '1dd5c535-8fff-4f58-a1d5-b50a1a6bbc2b'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))))))
-----------------------
Satellite 6.13
Before patching:
- time hammer content-export complete library --organization redhat
[....................................................................................................................................................................] [100%]
Generated /var/lib/pulp/exports/redhat/Export-Library/2.0/2023-09-28T13-19-18-10-00/metadata.json
real 205m38.375s <==========
user 0m20.745s
sys 0m2.636s
- ls -lrth
total 181G
rw-rr-. 1 pulp pulp 181G Sep 28 16:33 export-f77cd79e-156a-45ac-b4d1-7d360fcd07ce-20230928_0319.tar.gz
rw-rr-. 1 pulp pulp 333 Sep 28 16:44 export-f77cd79e-156a-45ac-b4d1-7d360fcd07ce-20230928_0319-toc.json
rw-rr-. 1 root root 4.4K Sep 28 16:44 metadata.json
After patching:
- time hammer content-export complete library --organization redhat
[....................................................................................................................................................................] [100%]
Generated /var/lib/pulp/exports/redhat/Export-Library/3.0/2023-09-28T16-49-38-10-00/metadata.json
real 162m14.007s <========
user 0m20.101s
sys 0m2.515s
- ls -lrth
- total 181G
rw-rr-. 1 pulp pulp 181G Sep 28 19:19 export-2e95a898-b264-431f-9c16-d4835966ef2f-20230928_0649.tar.gz
rw-rr-. 1 pulp pulp 333 Sep 28 19:31 export-2e95a898-b264-431f-9c16-d4835966ef2f-20230928_0649-toc.json
rw-rr-. 1 root root 4.4K Sep 28 19:31 metadata.json
So it is about 43 mins quicker after patching.
— Additional comment from on 2023-09-28T11:53:07Z
@dalley this is the patch changes that I made based on Ina suggestion. Can you please double check?
— a/app/models/repository.py 2023-06-15 00:35:27.000000000 +1000
+++ b/app/models/repository.py 2023-09-28 20:29:25.492366973 +1000
@@ -272,11 +272,12 @@
django.db.models.QuerySet: The artifacts that are contained within this version.
"""
- qs = Artifact.objects.filter(content_pk_in=version.content)
+ content_pks = set(version.content.values_list("pulp_id", flat=True))
for tree in DistributionTree.objects.filter(pk__in=version.content): - qs |= tree.artifacts()
+ if tree.content():
+ content_pks |= set(tree.content().values_list("pulp_id", flat=True))
- return qs
+ return Artifact.objects.filter(content_pk_in=content_pks)
class Meta:
default_related_name = "%(app_label)s_%(model_name)s"
— Additional comment from on 2023-09-28T18:19:12Z
It looks right and gives the correct number of values. The version in comment #3 seems to be faster, though. I'm going to go ahead and merge this upstream and move on to import for a bit.
— Additional comment from on 2023-09-28T22:36:03Z
But please don't wait for the built packages w/ this patch to re-test it on the reproduction system
— Additional comment from on 2023-10-03T06:37:51Z
@dalley Thanks, I'll test this in a couple of days. Just waiting on another export to finish and the day long cycle time.
— Additional comment from on 2023-10-04T22:31:39Z
I don't want to call this "fixed" yet because it is possible there are other queries which act as bottlenecks, but the fix described in comment #3 is in pulp_rpm 3.17.19 (and 3.18.18, and 3.19.10, and 3.22.5 for other Sat versions).
It's difficult to sync enough content locally to get the issue to reproduce as my SSD is only 256gb. So I'm somewhat dependent on the scale tests Keith is performing to learn if there are other queries acting as bottlenecks beyond this one. We're working on solving that dependency - Ina Panova has set up a cloud system of our own with a bit more space and hundreds of thousands of packages in the DB, which should help with being able to identify poorly performing queries on our own.
— Additional comment from on 2023-10-06T11:21:27Z
Testing of this on my ~512GB export has dramatically improved the export time. Went from 1541 minuets (~25hrs) down to 113 minutes (<2hrs).
Did observe that the super long postgres queries (~12hrs) have disappeared. There are a number of other postgres queries that get put into the log as the run for more than 1 second. I've attached those to the email chain, not sure if I can attach a log file to this BZ.
— Additional comment from on 2023-10-06T11:23:08Z
Created attachment 1992484
Postgres log of queries
Figured out how to attach the log file
— Additional comment from on 2023-10-06T13:27:39Z
Keith, that is excellent news, thanks for the testing! And thank you Hao for contributing this fix.
I'm now satisfied with calling this BZ fixed. I will still take a look at the new query list however since that may yield a few smaller improvements.
As this will generally affect strategic customers, I'm going to suggest backporting to 6.11 and 6.12 as well since that is where several of them have gotten stuck in their upgrade processes for one reason or another. Of course, we still want those users to upgrade to newer releases, but as seen with ASD this set of issues (there are a few other related BZs) prevent them from doing so.
— Additional comment from on 2023-10-16T13:59:26Z
Updating the fixed-in-version as the additional patch provides another ~30% improvement due to improved queries (or more specifically, less queries).
— Additional comment from on 2023-10-16T18:24:50Z
These 3 BZs ought to be tested together as they involve essentially the same setup process
https://bugzilla.redhat.com/show_bug.cgi?id=2238906 (this one)
https://bugzilla.redhat.com/show_bug.cgi?id=2228592
https://bugzilla.redhat.com/show_bug.cgi?id=2244521
— Additional comment from on 2023-10-18T13:04:24Z
-
-
- Bug 2228592 has been marked as a duplicate of this bug. ***
-
— Additional comment from on 2023-10-30T11:29:29Z
Bulk setting Target Milestone = 6.15.0 where sat-6.15.0+ is set.
— Additional comment from on 2023-10-30T21:00:30Z
Created attachment 1996266
RHEL 7 Hotfix RPM for Satellite 6.11.5
A hotfix is now available for Satellite 6.11.5 running on RHEL 7
HOTFIX INSTALL INSTRUCTIONS:
1. Take a complete backup or snapshot of Satellite 6.11.5 server
2. Obtain the Hotfix RPM for Satellite 6.11.5 on RHEL 7 from this attachment
3. # yum install ./tfm-pulpcore-python3-pulp-rpm-3.17.20-2.HOTFIXRHBZ2238906.el7pc.noarch.rpm --disableplugin=foreman-protector
4. # satellite-maintain service restart
— Additional comment from on 2023-10-30T21:03:28Z
Created attachment 1996267
RHEL 8 Hotfix RPM for Satellite 6.11.5
A hotfix is now available for Satellite 6.11.5 running on RHEL 8
HOTFIX INSTALL INSTRUCTIONS:
1. Take a complete backup or snapshot of Satellite 6.11.5 server
2. Obtain the Hotfix RPM for Satellite 6.11.5 on RHEL 8 from this attachment
3. # dnf install ./python38-pulp-rpm-3.17.20-2.HOTFIXRHBZ2238906.el8pc.noarch.rpm --disableplugin=foreman-protector
4. # satellite-maintain service restart
— Additional comment from on 2023-11-06T10:54:42Z
Hi Ian! Any plan to move this BZ to be part of 6.15.0 and 6.14.z?
— Additional comment from on 2023-11-06T14:28:07Z
Hi Shweta, it looks like this BZ is already acked for all Satellite versions.
— Additional comment from on 2023-11-16T15:07:16Z
Hello
Any chance we could get hotfixes for other Satellite versions other than 6.11?
— Additional comment from on 2023-11-16T15:12:57Z
Ideally users that are already on 6.12 would upgrade to 6.13 or 6.14, which would gain them these patches (and many other related ones that also benefit performance).
— Additional comment from on 2023-11-16T16:42:38Z
I wouldn't mind asking my customer to upgrade, if the fixes were released with errata. Unfortunately, that's not the case for this BZ (or the others you linked to be tested together on https://bugzilla.redhat.com/show_bug.cgi?id=2238906#c14 )
At this point, upgrading won't help much.
— Additional comment from on 2023-11-17T01:21:52Z
I could have sworn it was already out, but it seems it's just in line for the next z-stream. OK.
It's not a "simple" cherrypick because the performance issues are spread across a couple different BZs or many different commits addressing different root problems for one BZ. The way we handled the hotfix for 6.11 was to ship upstream pulp releases and probably we would do the same here?
— Additional comment from on 2023-11-17T01:26:39Z
What kind of turnaround are they needing? We have 6.14.1 coming out in 3 weeks.
— Additional comment from on 2023-11-17T13:29:14Z
hello Daniel
Thanks for taking time to address my questions.
Currently, they have 6.12 some patches applied that helped a lot (they were the customer trying the patch on https://bugzilla.redhat.com/show_bug.cgi?id=2228592, which was closed as a duplicate of this bz).
I wanted to offer them something better than patches. But, discussing with them yesterday they mentioned planing an upgrade to 6.13 (not 6.14 yet)
They're not in a hurry and I think it should be ok for them to wait until we have an errata.
— Additional comment from on 2023-11-17T15:13:36Z
Great!
There are some "special undocumented options" available which were added when we did these fixes for a different customer, that basically speeds up the export/import beyond what we could manage in a 100% backwards-compatible way (it requires a bit of handholding to enable). I would say try to make do without that, just using the 6.14.1 whenever it comes out, but if the customer is still unhappy we have them available. All of it (the patches, the behavior behind those options, and more) will be default behavior in 6.15 so whenever that comes out they will be well served to upgrade.
QE Tracker for https://issues.redhat.com/browse/SAT-22343
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2257299