Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Pod Autoscaler
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

OOM killer is killing openshift "keda-operator" pod

Version-Release number of selected component (if applicable):

4.16.z

How reproducible:

No

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

The OOM killer pod is killing the `keda-operator` pod multiple times.

Expected results:

The `keda-operator` pod should run without any issue.

Additional info:

- Custom Metrics Autoscaler Operator - 2.15.1-6

- In the keda operator pod logs, I can see that 
~~~
$ oc logs keda-operator-xx 

2025-05-30T08:09:26.903239942Z 2025-05-30T08:09:26Z    ERROR    scale_handler    error getting metric for trigger    {"scaledObject.Namespace": "distribox-pdt", "scaledObject.Name": "scaledobject-distribox-proxy-csz", "trigger": "prometheusScaler", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=avg%28rate%28nginx_http_requests_total%7Bnamespace%3D~%22distribox-pdt%22%2C+pod_name%3D~%22distribox-proxy-csz.%2A%22%2Cprometheus%3D%22tooling%22%7D%5B5m%5D%29%29&time=2025-05-30T08:09:23Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
2025-05-30T08:09:26.903239942Z     /src/pkg/scaling/scale_handler.go:553
2025-05-30T08:09:27.560969117Z     /src/pkg/scaling/scale_handler.go:524
2025-05-30T08:09:27.589319812Z 2025-05-30T08:09:27Z    ERROR    scale_handler    error getting metric for trigger    {"scaledObject.Namespace": "lcp-qcp1", "scaledObject.Name": "gtw-so", "trigger": "jboss_thread_pool_active_count", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=sum%28jboss_thread_pool_active_count%7Bapp_instance%3D~%22gtw%22%2C+container_name%3D~%22gtw%22%2C+pod_name%3D~%22gtw-rollout-.%2A%22%2C+phase%3D%22DEV%22%2C+name%3D%22ama-thread-pool%22%2C+namespace%3D%22lcp-qcp1%22%7D%29&time=2025-05-30T08:09:24Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
2025-05-30T08:09:27.589319812Z     /src/pkg/scaling/scale_handler.go:553
2025-05-30T08:09:29.209557987Z     /src/pkg/scaling/scale_handler.go:758
2025-05-30T08:09:29.209557987Z     /src/pkg/scaling/scale_handler.go:633
2025-05-30T08:09:29.209557987Z 2025-05-30T08:09:29Z    ERROR    scale_handler    error getting scale decision    {"scaledObject.Namespace": "lcp-qcp", "scaledObject.Name": "lcp-so", "scaler": "edge_client_outbound_http_freeChannels", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=sum%28edge_client_outbound_http_freeChannels%7Bcontainer_name%3D%22edge%22%2C+pod_name%3D~%22lcp-rollout-.%2A%22%2C+phase%3D%22DEV%22%2C+namespace%3D%22lcp-qcp%22+%7D%29&time=2025-05-30T08:09:26Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

2025-05-30T08:09:29.712190063Z 2025-05-30T08:09:29Z    ERROR    scale_handler    error getting metric for trigger    {"scaledObject.Namespace": "lcp-qcp", "scaledObject.Name": "lcp-so", "trigger": "edge_client_outbound_http_freeChannels", "error": "scaler with id 1 not found, len = 0, cache has been probably already invalidated"}
~~~    

- When I checked the namespaces, I don't find any scaledObject present in the namespace.

- This writtens an empty output:
~~~
$ oc get scaledObject --all-namespaces -o jsonpath="{.items[].spec.triggers[].metadata.namespace}"
~~~

- It appears that the KEDA Operator pod experienced a noticeable spike.

- I'm attaching the result of the below query:
~~~
sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", cluster="", namespace="openshift-keda", container!="", image!="",pod="keda-operator-64459b5476-mzwr8"})
~~~

Workaround:
As a workaround we Increase the `resources.limits` value of the operator container but again after some days the issue occur.

Assignee:: Joel Smith

Reporter:: Harshal Thakare

Need Info From:: None

Contributors:: None

QA Contact:: Paul Rozehnal

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/06/16 1:00 PM

Updated:: 2025/10/01 8:41 PM

Resolved:: 2025/10/01 8:41 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide