Uploaded image for project: 'Hybrid Cloud Console'
  1. Hybrid Cloud Console
  2. RHCLOUD-32849 3scale Latency RCA Action Items
  3. RHCLOUD-32866

[Spike] Investigate changes to the upstream apicast base image

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Unset
    • Access & Management Sprint 89, Access & Management Sprint 90, Access & Management Sprint 91, Access & Management Sprint 92, Access & Management Sprint 93, Access & Management Sprint 94, A&M Tech Debt Q10, Access & Management Sprint 95, Access & Management Sprint 95, Access & Management Sprint 96, Access & Management Sprint 97, Access & Management Sprint 98, Access & Management Sprint 99, Access & Management Sprint 100

      Spike on changes to the upstream apicast base image to make sure there wasn't a change that upped CPU consumption/etc 

      UPDATE 07-04-2024

      The root cause of the high latency has been determined. This PR in our fork in APIcast: https://github.com/RedHatInsights/APIcast/pull/11 synced our fork with the upstream repo. This pulled in a change from upstream that broke the functionality around disabling auto refreshing of the apicast configuration.

      Right now we configure apicast with the following:

      APICAST_CONFIGURATION_CACHE: -1
      APICAST_CONFIGURATION_LOADER: boot
      

      Per apicast docs, this is valid: https://github.com/3scale/APIcast/blob/master/doc/parameters.md
      This means we would like to load config on boot, and never auto refresh it

      This bit of code: https://github.com/RedHatInsights/APIcast/pull/11/files#diff-69edea98a4b41fba6e3d5f4fdcb9867158d5bd38eee86b79c2497f9944f482e9R207-R221 broke this functionality by calling the "schedule" method (line 218) once on startup. It passes the "handler" method as the scheduled callback. The "handler" method will then call itself recursively at the specified "interval", which is set to APICAST_CONFIGURATION_CACHE from our env: https://github.com/RedHatInsights/APIcast/pull/11/files#diff-69edea98a4b41fba6e3d5f4fdcb9867158d5bd38eee86b79c2497f9944f482e9R204
      With the interval set to -1, apicast calls the "ngx.timer.at" method from openresty, which takes the interval var and schedules the task with a delay of <interval> seconds. It appears to interpret -1 as 0 aka no delay. ngx.timer.at docs: https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#ngxtimerat

      This results in high latency because ngx.timer.at spawns a new lua coroutine on every call. This appears to eat up our cpu to its limit even when the gateway is completely idle (shown by deploying in ephemeral and monitoring metrics). This likely then causes all other coroutines (like the ones spawned to service requests) to eventually time out as there is no more cpu to handle them. We think this is why calls to cloudwatch aggregator hung indefinitely and crashed the service, since it probably couldnt respond to k8s health checks.

      Immediate mitigation: we can set "APICAST_CONFIGURATION_CACHE" to a very high number, 30 days/2592000 seconds, in order to prevent it from running nonstop with no delay. Since we don't actually use the auto refreshed configuration, it doesn't matter how frequently this runs.

      Root cause fix: What we need if for apicast to do its magic on boot once only, then schedule auto refresh completely separately. As it stands, these conditions are never actually checked: https://github.com/RedHatInsights/APIcast/pull/11/files#diff-69edea98a4b41fba6e3d5f4fdcb9867158d5bd38eee86b79c2497f9944f482e9R204

              rh-ee-dagbay Daniel Agbay
              abaiken Ashley Morgan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: