Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-5502

Regular Quay slowness / server errors disrupt OSUS operations

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • quay.io
    • False
    • None
    • False

      RH OSUS (aka Cincinnati) instance periodically scrapes quay.io/openshift-release-dev/ocp-release repo to discover new OCP releases to be added into OCP upgrade graph. We are seeing a regular event on Monday mornings where the scrapes are partly timing out, partly see various 50X errors:

      [2023-04-17T11:51:59Z INFO graph_builder::graph] graph update triggered
      [2023-04-17T11:56:59Z ERROR graph_builder::graph] Processing all plugins with a timeout of 300s
      [2023-04-17T11:56:59Z ERROR graph_builder::graph] deadline has elapsed
      [2023-04-17T12:01:59Z INFO graph_builder::graph] graph update triggered
      [2023-04-17T12:03:47Z ERROR graph_builder::graph] failed to fetch all release metadata from quay.io/openshift-release-dev/ocp-release
      [2023-04-17T12:03:47Z ERROR graph_builder::graph] fetching manifest and manifestref for openshift-release-dev/ocp-release:4.12.0-0.nightly-multi-2022-08-08-134208: unexpected HTTP status 504 Gateway Timeout
      [2023-04-17T12:08:47Z INFO graph_builder::graph] graph update triggered
      [2023-04-17T12:09:45Z INFO graph_builder::graph] graph update completed, 7169 valid releases
      

      The actual scrape is known to be quite expensive operation (quay.io/openshift-release-dev/ocp-release has over 7k tags) but is is performed periodically and steadily without any peaks on our side (it is not request-driven, it cannot be triggered externally) and we only see these disruptions in a pretty deterministic time window every week, which hints towards interfering with some kind of batch job scheduled for that window.

      The affected time window seems to be between 09:00Z and 12:00Z every Monday. We do not see other regular disruptions like this, but the Monday one is very reliable.

      I have also created OTA-971 on our side to make Cincinnati use webhooks instead of periodic listing. That will reduce OSUS sensitivity to the problem, but there are other consumers of that repo (like CI) which are affected by it, so it should be investigated on the Quay side, too.

      Filed as a folllowup to a Slack thread: https://redhat-internal.slack.com/archives/C7WH69HCY/p1679913378496759

              Unassigned Unassigned
              afri@afri.cz Petr Muller
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: