Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-607

changing SERVER_HOSTNAME triggers storage replication and 100% database CPU

XMLWordPrintable

    • 0

      We have noticed that our Quay environment was getting slower and slower again.
      Browsing the UI even gave lots of "500 Internal Server Errors".

      Upon further investigation it turned out that the Database (Azure Postgres, 4 Cores) was at a constant high load of 100%.

      Here is what we think happens and what we have done so far:

      • On April 2nd we changed the value for "SERVER_HOSTNAME" in the Quay Config file.
        Reason: So far we used quay-azr.cloud.internal as the hostname, to indicate this Quay instance is running on Azure.
        Since we plan to only use this environment for now, it was decided to rename to a more generic name "quay.cloud.internal".
        So i generated a new certificate (which allows old and new name) and changed the bespoke parameter.
      • On Friday we rolled out a change to the Quay infra and due to a mistake also enabled the PROXY_STORAGE feature again.
      • On Sunday we noticed Quay is nearly unusable
      • We analyzed a bit what is going on in the DB to figure out what is causing the high load and it turns out, that the queueitem table had 1.2 Mio rows inside, that looks like they belong to the Storage-Replication Feature

      Example:

      2020-04-17 00:46:36.118757 t
        5 cb30ec1f-3b1f-4000-b4cd-c311aee1df46
      9404907
      imagestoragereplication/c2969e2d-2b49-417c-868a-cda2d9751456/ {"namespace_user_id": 9, "storage_id": "c2969e2d-2b49-417c-868a-cda2d9751456"}
      2020-04-17 00:46:50.402728 t
        5 d76b426d-40de-44e4-8b20-6c3113998077
      9404908
      imagestoragereplication/f6eae9ec-af0a-486f-ab33-bc3f84c95d11/ {"namespace_user_id": 9, "storage_id": "f6eae9ec-af0a-486f-ab33-bc3f84c95d11"}
      2020-04-17 00:46:50.402756 t
        5 ab6e2d92-4b92-491a-93ad-881a50d9cf7e
      9404909
      imagestoragereplication/32402315-1a79-427b-8335-ff7f4affa35d/ {"namespace_user_id": 9, "storage_id": "32402315-1a79-427b-8335-ff7f4affa35d"}
      2020-04-17 00:46:50.402772 t
        5 7052f456-5031-4d01-9449-3511beff669a
      9404910
      imagestoragereplication/8c67cfbf-3cb1-457e-99f2-240f4329b343/ {"namespace_user_id": 9, "storage_id": "8c67cfbf-3cb1-457e-99f2-240f4329b343"}
      2020-04-17 00:46:50.402788 t
        5 0ae07b38-24c0-433d-9dc4-e09fcf3d290b
      9405002
      imagestoragereplication/8b8023ff-3d49-45d7-83b4-4e51b6bd467e/ {"namespace_user_id": 9, "storage_id": "8b8023ff-3d49-45d7-83b4-4e51b6bd467e"}
      2020-04-17 00:46:59.231941 t
        5 4a45a00f-9489-4757-a8da-7b3ba4f03503
      9405003
      imagestoragereplication/3f826614-7083-4946-aeea-1d4693e842b4/ {"namespace_user_id": 9, "storage_id": "3f826614-7083-4946-aeea-1d4693e842b4"}
      2020-04-17 00:46:59.23199 t
        5 91f9a43a-b1c9-40b9-b097-27114d85bbb8
      9405004
      imagestoragereplication/dbdb094c-cdfa-48ed-b74e-7ef933238765/ {"namespace_user_id": 9, "storage_id": "dbdb094c-cdfa-48ed-b74e-7ef933238765"}

      We always had the flag FEATURE_STORAGE_REPLICATION set to true as a preparation in case we would add more backends later, but we only ever used 1 Azure Storage Account as the default backend so far.

      In order to get Quay back alive we did the following:

      • Stop all containers
      • Dump the queueitem table to a file
      • set FEATURE_STORAGE_REPLICATION=fals
      • Delete all storagrepliaction rows from the queueitem table
      • Start the containers again

      This has helped a lot as you can see by the attached graph.
      The graph is for the last 30 days and shows nicely how the load changed .....

      So please advise :

      • Why was the StorageReplication process triggered by the changes we did ?
      • What did the Replication actually try t do ?
      • Why would the enablement of the Proxy_Storage Feature cause such a high load
      • Where the fixes we applied the other day good or did we cause more trouble ?
      • Is there more stuff in the DB that now needs cleanup ? ( Entries for replication, storagelocation, imgastorage, etc etc.)

      Where are you experiencing the behavior? What environment?
      Production

            tomckay@redhat.com Thomas Mckay
            icherapa@redhat.com Ivan C (Inactive)
            Dongbo Yan Dongbo Yan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: