Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57483

OOM killer is killing openshift keda operator pod

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Normal Normal
    • None
    • 4.16.z
    • Pod Autoscaler
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      OOM killer is killing openshift "keda-operator" pod   

      Version-Release number of selected component (if applicable):

      4.16.z    

      How reproducible:

       No   

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      The OOM killer pod is killing the `keda-operator` pod multiple times. 

      Expected results:

      The `keda-operator` pod should run without any issue.    

      Additional info:

      - Custom Metrics Autoscaler Operator - 2.15.1-6
      
      - In the keda operator pod logs, I can see that 
      ~~~
      $ oc logs keda-operator-xx 
      
      2025-05-30T08:09:26.903239942Z 2025-05-30T08:09:26Z    ERROR    scale_handler    error getting metric for trigger    {"scaledObject.Namespace": "distribox-pdt", "scaledObject.Name": "scaledobject-distribox-proxy-csz", "trigger": "prometheusScaler", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=avg%28rate%28nginx_http_requests_total%7Bnamespace%3D~%22distribox-pdt%22%2C+pod_name%3D~%22distribox-proxy-csz.%2A%22%2Cprometheus%3D%22tooling%22%7D%5B5m%5D%29%29&time=2025-05-30T08:09:23Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
      2025-05-30T08:09:26.903239942Z     /src/pkg/scaling/scale_handler.go:553
      2025-05-30T08:09:27.560969117Z     /src/pkg/scaling/scale_handler.go:524
      2025-05-30T08:09:27.589319812Z 2025-05-30T08:09:27Z    ERROR    scale_handler    error getting metric for trigger    {"scaledObject.Namespace": "lcp-qcp1", "scaledObject.Name": "gtw-so", "trigger": "jboss_thread_pool_active_count", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=sum%28jboss_thread_pool_active_count%7Bapp_instance%3D~%22gtw%22%2C+container_name%3D~%22gtw%22%2C+pod_name%3D~%22gtw-rollout-.%2A%22%2C+phase%3D%22DEV%22%2C+name%3D%22ama-thread-pool%22%2C+namespace%3D%22lcp-qcp1%22%7D%29&time=2025-05-30T08:09:24Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
      2025-05-30T08:09:27.589319812Z     /src/pkg/scaling/scale_handler.go:553
      2025-05-30T08:09:29.209557987Z     /src/pkg/scaling/scale_handler.go:758
      2025-05-30T08:09:29.209557987Z     /src/pkg/scaling/scale_handler.go:633
      2025-05-30T08:09:29.209557987Z 2025-05-30T08:09:29Z    ERROR    scale_handler    error getting scale decision    {"scaledObject.Namespace": "lcp-qcp", "scaledObject.Name": "lcp-so", "scaler": "edge_client_outbound_http_freeChannels", "error": "Get \"http://thanos-query-main.argos.svc.cluster.local:10902/api/v1/query?query=sum%28edge_client_outbound_http_freeChannels%7Bcontainer_name%3D%22edge%22%2C+pod_name%3D~%22lcp-rollout-.%2A%22%2C+phase%3D%22DEV%22%2C+namespace%3D%22lcp-qcp%22+%7D%29&time=2025-05-30T08:09:26Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
      
      2025-05-30T08:09:29.712190063Z 2025-05-30T08:09:29Z    ERROR    scale_handler    error getting metric for trigger    {"scaledObject.Namespace": "lcp-qcp", "scaledObject.Name": "lcp-so", "trigger": "edge_client_outbound_http_freeChannels", "error": "scaler with id 1 not found, len = 0, cache has been probably already invalidated"}
      ~~~    
      
      - When I checked the namespaces, I don't find any scaledObject present in the namespace.
      
      - This writtens an empty output:
      ~~~
      $ oc get scaledObject --all-namespaces -o jsonpath="{.items[].spec.triggers[].metadata.namespace}"
      ~~~
      
      - It appears that the KEDA Operator pod experienced a noticeable spike.
      
      - I'm attaching the result of the below query:
      ~~~
      sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", cluster="", namespace="openshift-keda", container!="", image!="",pod="keda-operator-64459b5476-mzwr8"})
      ~~~
      

      Workaround: 
      As a workaround we Increase the `resources.limits` value of the operator container but again after some days the issue occur.

              joelsmith.redhat Joel Smith
              rhn-support-hthakare Harshal Thakare
              None
              None
              Paul Rozehnal Paul Rozehnal
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: