Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-22343 Long running postgres threads during content-export
  3. SAT-23311

[QE] Long running postgres threads during content-export

XMLWordPrintable

    • Icon: Sub-task Sub-task
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • Pulp
    • 0
    • False
    • Hide

      None

      Show
      None
    • False
    • 0

      +++ This bug was initially created as a clone of Bug #2238906 +++

      Description of problem:
      During large (>500GB) content exports, there is a massive increase in time taken for the export and this is due to a very long running postgres thread prior to the export files being written.

      This BZ is raised in relation to support case #03609607.

      During a 512GB export, 21hours was spent by the following processes before any export files are written:
      100% CPU util for: "postgres: parallel worker for PID 12890"
      PID 12890: "postgres: pulp pulpcore ::1946170) SELECT"

      Version-Release number of selected component (if applicable):
      6.11.5.4

      How reproducible:
      Everytime

      Steps to Reproduce:
      Testing has been done on a 512GB export that contains:
      Repositories: 27 (both Red Hat and EPEL)
      RPM Packages: ~166,000
      Size of export: 512GB
      Command run: time hammer content-export complete cersion --content-view testexport --version 2 --organization-id 1 --chunk-size-gb 200

      Actual results:
      Time to create: ~25.5hrs

      Expected results:
      Ideally less than 6hrs to create the export.

      Additional info:
      Please see the support case #03609607 which has a lot more contextual information.
      The customer require being able to produce +2.1TB exports in a timely manner.
      Testing of exporting a 2.1TB export that contains:
      Repositories: 151 (both Red Hat and 3rd party repos)
      RPM Packages: >200,000
      Size of export: 2.1TB
      Time to create: ~37hrs

      Why we think this needs to be looked at is due to the time taken for a smaller export of 300GB:
      Repositories: 9 (both Red Hat and EPEL)
      RPM Packages: ~130,000
      Size of export: 306GB
      Time to create: 3hrs 25mins
      There is a non-linear growth in the time taken for larger exports.

      — Additional comment from on 2023-09-15T02:40:08Z

      Hi team,

      I am enabling the Escalation flag due to the business impact below:

      Business Impact
      ===============

      • Australian Signals Directorate (ASD) is a vital member of Australia’s national security community.
      • The issue is severely impacting the customer's ability to update to a supported Satellite release.
      • It is blocking RHEL9 from being brought into the environment, as a result, this is preventing a large Red Hat OpenStack instance from being upgraded to 17.
      • A $3m standard to premium support upgrade deal for OpenStack could be at risk as the customer is questioning the value of Premium Support from this case(for 3+ months)

      Background
      ==========

      • Red Hat works across the full spectrum of operations required of contemporary signals intelligence and security agencies. This includes intelligence, cyber security and offensive cyber operations in support of the Australian Government and Australian Defence Force (ADF).

      Expectation
      ===========

      • A patch or update which will allow for large imports/exports (2-3TB) to be completed within one business day
      • Timeline: 31 Oct 2023

      Any input/thoughts is highly appreciated!

      Kind regards,
      Jennie
      Senior Escalation Manager

      — Additional comment from on 2023-09-28T02:52:46Z

      Hao and I have been looking at this issue. We have found that it is actually somewhat tricky to reproduce, you need a certain combination of repos and a certain scale to get there.

      For instance, exporting all of these repos will allow the issue to be reproduced:

      Red Hat Enterprise Linux 6 Server Kickstart x86_64 6.10
      Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server
      Red Hat Enterprise Linux 8 for x86_64 - AppStream Kickstart 8.8
      Red Hat Enterprise Linux 8 for x86_64 - AppStream RPMs 8
      Red Hat Enterprise Linux 8 for x86_64 - BaseOS Kickstart 8.8
      Red Hat Enterprise Linux 8 for x86_64 - BaseOS RPMs 8

      But removing any one of them will not. I'm not claiming it's the only combination that will trigger this, just that it's what we stumbled across. It may not have anything to do with the specific repos but rather the scale, number of RPMs, etc.

      Hao uncovered that the likely culprit is the implementation of "artifacts_for_version" in pulp_rpm. A tweaked implementation yielded significant benefits in a simple python shell test, and we're currently working on testing it with an actual export and working out the best implementation.

      — Additional comment from on 2023-09-28T05:45:23Z

      Keith, could you apply this patch (courtesy of Hao Yu) to the reproduction box, try another export and report back on how it goes? Keep the slow query logging as we'll still want to see what else is there.

      Ina suggested a few potential further improvements to this query, but we know this works pretty well as-is.

      diff --git a/pulp_rpm/app/models/repository.py b/pulp_rpm/app/models/repository.py
      index dee378f2..d68e4ebe 100644
      — a/pulp_rpm/app/models/repository.py
      +++ b/pulp_rpm/app/models/repository.py
      @@ -264,11 +264,11 @@ class RpmRepository(Repository):
      django.db.models.QuerySet: The artifacts that are contained within this version.

      """

      • qs = Artifact.objects.filter(content_pk_in=version.content)
        + artifacts_pk = set(Artifact.objects.filter(content_pk_in=version.content).values_list("pulp_id", flat=True))
        for tree in DistributionTree.objects.filter(pk__in=version.content):
      • qs |= tree.artifacts()
        + artifacts_pk |= set(tree.artifacts().values_list("pulp_id", flat=True))
      • return qs
        + return Artifact.objects.filter(pk__in=artifacts_pk)

      class Meta:
      default_related_name = "%(app_label)s_%(model_name)s"

      — Additional comment from on 2023-09-28T10:25:11Z

      In my test machine with the repositories setup mentioned in comment #2, the long query took about 36mins to complete.
      -------------------
      2023-09-28 13:56:01 AEST LOG: duration: 2201511.678 ms statement: SELECT "core_artifact"."pulp_id", "core_artifact"."pulp_created", "core_artifact"."pulp_last_updated", "core_artifact"."file", "core_artifact"."size", "core_artifact"."md5", "core_artifact"."sha1", "core_artifact"."sha224", "core_artifact"."sha256", "core_artifact"."sha384", "core_artifact"."sha512", "core_artifact"."timestamp_of_interest" FROM "core_artifact" INNER JOIN "core_contentartifact" ON ("core_artifact"."pulp_id" = "core_contentartifact"."artifact_id") WHERE ("core_contentartifact"."content_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '739a98ff-df61-4bbd-9be6-ede4ef23571b'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL)))) OR "core_contentartifact"."content_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE (V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '780b8895-47bb-42f5-8297-b86922852eb3'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = 'be28f8de-8143-47f9-98bc-a2de0eacdd13'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = 'da3015d3-1305-4ab1-9cfc-09fd90c990d4'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '5f47f38d-f342-4bdb-8168-b8f151f391a1'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) OR V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '1dd5c535-8fff-4f58-a1d5-b50a1a6bbc2b'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))))))
      -----------------------

      Satellite 6.13

      Before patching:

      1. time hammer content-export complete library --organization redhat
        [....................................................................................................................................................................] [100%]
        Generated /var/lib/pulp/exports/redhat/Export-Library/2.0/2023-09-28T13-19-18-10-00/metadata.json

      real 205m38.375s <==========
      user 0m20.745s
      sys 0m2.636s

      1. ls -lrth
        total 181G
        rw-rr-. 1 pulp pulp 181G Sep 28 16:33 export-f77cd79e-156a-45ac-b4d1-7d360fcd07ce-20230928_0319.tar.gz
        rw-rr-. 1 pulp pulp 333 Sep 28 16:44 export-f77cd79e-156a-45ac-b4d1-7d360fcd07ce-20230928_0319-toc.json
        rw-rr-. 1 root root 4.4K Sep 28 16:44 metadata.json

      After patching:

      1. time hammer content-export complete library --organization redhat
        [....................................................................................................................................................................] [100%]
        Generated /var/lib/pulp/exports/redhat/Export-Library/3.0/2023-09-28T16-49-38-10-00/metadata.json

      real 162m14.007s <========
      user 0m20.101s
      sys 0m2.515s

      1. ls -lrth
      2. total 181G
        rw-rr-. 1 pulp pulp 181G Sep 28 19:19 export-2e95a898-b264-431f-9c16-d4835966ef2f-20230928_0649.tar.gz
        rw-rr-. 1 pulp pulp 333 Sep 28 19:31 export-2e95a898-b264-431f-9c16-d4835966ef2f-20230928_0649-toc.json
        rw-rr-. 1 root root 4.4K Sep 28 19:31 metadata.json

      So it is about 43 mins quicker after patching.

      — Additional comment from on 2023-09-28T11:53:07Z

      @dalley this is the patch changes that I made based on Ina suggestion. Can you please double check?

      — a/app/models/repository.py 2023-06-15 00:35:27.000000000 +1000
      +++ b/app/models/repository.py 2023-09-28 20:29:25.492366973 +1000
      @@ -272,11 +272,12 @@
      django.db.models.QuerySet: The artifacts that are contained within this version.

      """

      • qs = Artifact.objects.filter(content_pk_in=version.content)
        + content_pks = set(version.content.values_list("pulp_id", flat=True))
        for tree in DistributionTree.objects.filter(pk__in=version.content):
      • qs |= tree.artifacts()
        + if tree.content():
        + content_pks |= set(tree.content().values_list("pulp_id", flat=True))
      • return qs
        + return Artifact.objects.filter(content_pk_in=content_pks)

      class Meta:
      default_related_name = "%(app_label)s_%(model_name)s"

      — Additional comment from on 2023-09-28T18:19:12Z

      It looks right and gives the correct number of values. The version in comment #3 seems to be faster, though. I'm going to go ahead and merge this upstream and move on to import for a bit.

      — Additional comment from on 2023-09-28T22:36:03Z

      But please don't wait for the built packages w/ this patch to re-test it on the reproduction system

      — Additional comment from on 2023-10-03T06:37:51Z

      @dalley Thanks, I'll test this in a couple of days. Just waiting on another export to finish and the day long cycle time.

      — Additional comment from on 2023-10-04T22:31:39Z

      I don't want to call this "fixed" yet because it is possible there are other queries which act as bottlenecks, but the fix described in comment #3 is in pulp_rpm 3.17.19 (and 3.18.18, and 3.19.10, and 3.22.5 for other Sat versions).

      It's difficult to sync enough content locally to get the issue to reproduce as my SSD is only 256gb. So I'm somewhat dependent on the scale tests Keith is performing to learn if there are other queries acting as bottlenecks beyond this one. We're working on solving that dependency - Ina Panova has set up a cloud system of our own with a bit more space and hundreds of thousands of packages in the DB, which should help with being able to identify poorly performing queries on our own.

      — Additional comment from on 2023-10-06T11:21:27Z

      Testing of this on my ~512GB export has dramatically improved the export time. Went from 1541 minuets (~25hrs) down to 113 minutes (<2hrs).

      Did observe that the super long postgres queries (~12hrs) have disappeared. There are a number of other postgres queries that get put into the log as the run for more than 1 second. I've attached those to the email chain, not sure if I can attach a log file to this BZ.

      — Additional comment from on 2023-10-06T11:23:08Z

      Created attachment 1992484
      Postgres log of queries

      Figured out how to attach the log file

      — Additional comment from on 2023-10-06T13:27:39Z

      Keith, that is excellent news, thanks for the testing! And thank you Hao for contributing this fix.

      I'm now satisfied with calling this BZ fixed. I will still take a look at the new query list however since that may yield a few smaller improvements.

      As this will generally affect strategic customers, I'm going to suggest backporting to 6.11 and 6.12 as well since that is where several of them have gotten stuck in their upgrade processes for one reason or another. Of course, we still want those users to upgrade to newer releases, but as seen with ASD this set of issues (there are a few other related BZs) prevent them from doing so.

      — Additional comment from on 2023-10-16T13:59:26Z

      Updating the fixed-in-version as the additional patch provides another ~30% improvement due to improved queries (or more specifically, less queries).

      — Additional comment from on 2023-10-16T18:24:50Z

      These 3 BZs ought to be tested together as they involve essentially the same setup process

      https://bugzilla.redhat.com/show_bug.cgi?id=2238906 (this one)
      https://bugzilla.redhat.com/show_bug.cgi?id=2228592
      https://bugzilla.redhat.com/show_bug.cgi?id=2244521

      — Additional comment from on 2023-10-18T13:04:24Z

          • Bug 2228592 has been marked as a duplicate of this bug. ***

      — Additional comment from on 2023-10-30T11:29:29Z

      Bulk setting Target Milestone = 6.15.0 where sat-6.15.0+ is set.

      — Additional comment from on 2023-10-30T21:00:30Z

      Created attachment 1996266
      RHEL 7 Hotfix RPM for Satellite 6.11.5

      A hotfix is now available for Satellite 6.11.5 running on RHEL 7

      HOTFIX INSTALL INSTRUCTIONS:

      1. Take a complete backup or snapshot of Satellite 6.11.5 server

      2. Obtain the Hotfix RPM for Satellite 6.11.5 on RHEL 7 from this attachment

      3. # yum install ./tfm-pulpcore-python3-pulp-rpm-3.17.20-2.HOTFIXRHBZ2238906.el7pc.noarch.rpm --disableplugin=foreman-protector

      4. # satellite-maintain service restart

      — Additional comment from on 2023-10-30T21:03:28Z

      Created attachment 1996267
      RHEL 8 Hotfix RPM for Satellite 6.11.5

      A hotfix is now available for Satellite 6.11.5 running on RHEL 8

      HOTFIX INSTALL INSTRUCTIONS:

      1. Take a complete backup or snapshot of Satellite 6.11.5 server

      2. Obtain the Hotfix RPM for Satellite 6.11.5 on RHEL 8 from this attachment

      3. # dnf install ./python38-pulp-rpm-3.17.20-2.HOTFIXRHBZ2238906.el8pc.noarch.rpm --disableplugin=foreman-protector

      4. # satellite-maintain service restart

      — Additional comment from on 2023-11-06T10:54:42Z

      Hi Ian! Any plan to move this BZ to be part of 6.15.0 and 6.14.z?

      — Additional comment from on 2023-11-06T14:28:07Z

      Hi Shweta, it looks like this BZ is already acked for all Satellite versions.

      — Additional comment from on 2023-11-16T15:07:16Z

      Hello

      Any chance we could get hotfixes for other Satellite versions other than 6.11?

      — Additional comment from on 2023-11-16T15:12:57Z

      Ideally users that are already on 6.12 would upgrade to 6.13 or 6.14, which would gain them these patches (and many other related ones that also benefit performance).

      — Additional comment from on 2023-11-16T16:42:38Z

      I wouldn't mind asking my customer to upgrade, if the fixes were released with errata. Unfortunately, that's not the case for this BZ (or the others you linked to be tested together on https://bugzilla.redhat.com/show_bug.cgi?id=2238906#c14 )

      At this point, upgrading won't help much.

      — Additional comment from on 2023-11-17T01:21:52Z

      I could have sworn it was already out, but it seems it's just in line for the next z-stream. OK.

      It's not a "simple" cherrypick because the performance issues are spread across a couple different BZs or many different commits addressing different root problems for one BZ. The way we handled the hotfix for 6.11 was to ship upstream pulp releases and probably we would do the same here?

      — Additional comment from on 2023-11-17T01:26:39Z

      What kind of turnaround are they needing? We have 6.14.1 coming out in 3 weeks.

      — Additional comment from on 2023-11-17T13:29:14Z

      hello Daniel

      Thanks for taking time to address my questions.

      Currently, they have 6.12 some patches applied that helped a lot (they were the customer trying the patch on https://bugzilla.redhat.com/show_bug.cgi?id=2228592, which was closed as a duplicate of this bz).

      I wanted to offer them something better than patches. But, discussing with them yesterday they mentioned planing an upgrade to 6.13 (not 6.14 yet)

      They're not in a hurry and I think it should be ok for them to wait until we have an errata.

      — Additional comment from on 2023-11-17T15:13:36Z

      Great!

      There are some "special undocumented options" available which were added when we did these fixes for a different customer, that basically speeds up the export/import beyond what we could manage in a 100% backwards-compatible way (it requires a bit of handholding to enable). I would say try to make do without that, just using the 6.14.1 whenever it comes out, but if the customer is still unhappy we have them available. All of it (the patches, the behavior behind those options, and more) will be default behavior in 6.15 so whenever that comes out they will be well served to upgrade.

      QE Tracker for https://issues.redhat.com/browse/SAT-22343
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2257299

              rhn-support-shwsingh Shweta Singh (Inactive)
              satellite-focaccia-bot Focaccia Bot
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: