-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.16.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
OOM killer is killing openshift "keda-operator" pod
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
No
Steps to Reproduce:
1. 2. 3.
Actual results:
The OOM killer pod is killing the `keda-operator` pod multiple times.
Expected results:
The `keda-operator` pod should run without any issue.
Additional info:
- Custom Metrics Autoscaler Operator - 2.15.1-6 - In the keda operator pod logs, I can see that ~~~ $ oc logs keda-operator-xx 2025-05-30T08:09:26.903239942Z 2025-05-30T08:09:26Z ERROR scale_handler error getting metric for trigger {"scaledObject.Namespace": "distribox-pdt", "scaledObject.Name": "scaledobject-distribox-proxy-csz", "trigger": "prometheusScaler", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=avg%28rate%28nginx_http_requests_total%7Bnamespace%3D~%22distribox-pdt%22%2C+pod_name%3D~%22distribox-proxy-csz.%2A%22%2Cprometheus%3D%22tooling%22%7D%5B5m%5D%29%29&time=2025-05-30T08:09:23Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"} 2025-05-30T08:09:26.903239942Z /src/pkg/scaling/scale_handler.go:553 2025-05-30T08:09:27.560969117Z /src/pkg/scaling/scale_handler.go:524 2025-05-30T08:09:27.589319812Z 2025-05-30T08:09:27Z ERROR scale_handler error getting metric for trigger {"scaledObject.Namespace": "lcp-qcp1", "scaledObject.Name": "gtw-so", "trigger": "jboss_thread_pool_active_count", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=sum%28jboss_thread_pool_active_count%7Bapp_instance%3D~%22gtw%22%2C+container_name%3D~%22gtw%22%2C+pod_name%3D~%22gtw-rollout-.%2A%22%2C+phase%3D%22DEV%22%2C+name%3D%22ama-thread-pool%22%2C+namespace%3D%22lcp-qcp1%22%7D%29&time=2025-05-30T08:09:24Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"} 2025-05-30T08:09:27.589319812Z /src/pkg/scaling/scale_handler.go:553 2025-05-30T08:09:29.209557987Z /src/pkg/scaling/scale_handler.go:758 2025-05-30T08:09:29.209557987Z /src/pkg/scaling/scale_handler.go:633 2025-05-30T08:09:29.209557987Z 2025-05-30T08:09:29Z ERROR scale_handler error getting scale decision {"scaledObject.Namespace": "lcp-qcp", "scaledObject.Name": "lcp-so", "scaler": "edge_client_outbound_http_freeChannels", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=sum%28edge_client_outbound_http_freeChannels%7Bcontainer_name%3D%22edge%22%2C+pod_name%3D~%22lcp-rollout-.%2A%22%2C+phase%3D%22DEV%22%2C+namespace%3D%22lcp-qcp%22+%7D%29&time=2025-05-30T08:09:26Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"} 2025-05-30T08:09:29.712190063Z 2025-05-30T08:09:29Z ERROR scale_handler error getting metric for trigger {"scaledObject.Namespace": "lcp-qcp", "scaledObject.Name": "lcp-so", "trigger": "edge_client_outbound_http_freeChannels", "error": "scaler with id 1 not found, len = 0, cache has been probably already invalidated"} ~~~ - When I checked the namespaces, I don't find any scaledObject present in the namespace. - This writtens an empty output: ~~~ $ oc get scaledObject --all-namespaces -o jsonpath="{.items[].spec.triggers[].metadata.namespace}" ~~~ - It appears that the KEDA Operator pod experienced a noticeable spike. - I'm attaching the result of the below query: ~~~ sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", cluster="", namespace="openshift-keda", container!="", image!="",pod="keda-operator-64459b5476-mzwr8"}) ~~~
Workaround:
As a workaround we Increase the `resources.limits` value of the operator container but again after some days the issue occur.