-
Sub-task
-
Resolution: Done
-
Normal
-
None
-
False
-
-
False
-
Unset
-
-
-
Access & Management Sprint 89, Access & Management Sprint 90, Access & Management Sprint 91, Access & Management Sprint 92, Access & Management Sprint 93, Access & Management Sprint 94, A&M Tech Debt Q10, Access & Management Sprint 95, Access & Management Sprint 95, Access & Management Sprint 96, Access & Management Sprint 97, Access & Management Sprint 98, Access & Management Sprint 99, Access & Management Sprint 100
Spike on changes to the upstream apicast base image to make sure there wasn't a change that upped CPU consumption/etcÂ
UPDATE 07-04-2024
The root cause of the high latency has been determined. This PR in our fork in APIcast: https://github.com/RedHatInsights/APIcast/pull/11 synced our fork with the upstream repo. This pulled in a change from upstream that broke the functionality around disabling auto refreshing of the apicast configuration.
Right now we configure apicast with the following:
APICAST_CONFIGURATION_CACHE: -1 APICAST_CONFIGURATION_LOADER: boot
Per apicast docs, this is valid: https://github.com/3scale/APIcast/blob/master/doc/parameters.md
This means we would like to load config on boot, and never auto refresh it
This bit of code: https://github.com/RedHatInsights/APIcast/pull/11/files#diff-69edea98a4b41fba6e3d5f4fdcb9867158d5bd38eee86b79c2497f9944f482e9R207-R221 broke this functionality by calling the "schedule" method (line 218) once on startup. It passes the "handler" method as the scheduled callback. The "handler" method will then call itself recursively at the specified "interval", which is set to APICAST_CONFIGURATION_CACHE from our env: https://github.com/RedHatInsights/APIcast/pull/11/files#diff-69edea98a4b41fba6e3d5f4fdcb9867158d5bd38eee86b79c2497f9944f482e9R204
With the interval set to -1, apicast calls the "ngx.timer.at" method from openresty, which takes the interval var and schedules the task with a delay of <interval> seconds. It appears to interpret -1 as 0 aka no delay. ngx.timer.at docs: https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#ngxtimerat
This results in high latency because ngx.timer.at spawns a new lua coroutine on every call. This appears to eat up our cpu to its limit even when the gateway is completely idle (shown by deploying in ephemeral and monitoring metrics). This likely then causes all other coroutines (like the ones spawned to service requests) to eventually time out as there is no more cpu to handle them. We think this is why calls to cloudwatch aggregator hung indefinitely and crashed the service, since it probably couldnt respond to k8s health checks.
Immediate mitigation: we can set "APICAST_CONFIGURATION_CACHE" to a very high number, 30 days/2592000 seconds, in order to prevent it from running nonstop with no delay. Since we don't actually use the auto refreshed configuration, it doesn't matter how frequently this runs.
Root cause fix: What we need if for apicast to do its magic on boot once only, then schedule auto refresh completely separately. As it stands, these conditions are never actually checked: https://github.com/RedHatInsights/APIcast/pull/11/files#diff-69edea98a4b41fba6e3d5f4fdcb9867158d5bd38eee86b79c2497f9944f482e9R204