Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29690

haproxy oom - troubleshoot process

XMLWordPrintable

    • +
    • Critical
    • Yes
    • 5
    • Sprint 250, Sprint 251, Sprint 252, Sprint 253
    • 4
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link:https://issues.redhat.com/browse/OCPBUGS-29690[*OCPBUGS-29690*])
      Show
      * Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link: https://issues.redhat.com/browse/OCPBUGS-29690 [* OCPBUGS-29690 *])
    • Bug Fix
    • Done

      Description of problem:

          Router are restarting due to memory issues

      Version-Release number of selected component (if applicable):

          OCP 4.12.45

      How reproducible:

          not easy
      Router restart due to memory issues:
      ~~~
      3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
      3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Liveness probe error: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
      3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Liveness probe failed: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      3h40m       Normal    Killing      pod/router-default-56c9f67f66-j8xwn                        Container router failed liveness probe, will be restarted
      3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: HTTP probe failed with statuscode: 500...
      3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: HTTP probe failed with statuscode: 500
      ~~~
      
      The node only host the router replica, and from prometheus it can be verified that routers are consumming all the memory in a short period of time ~20G with an hour.
      
      At some point, the number of haproxy are increasing and ending consuming all memory resources leading in a service disruption in a productive environment.
      
      As console is one of the service with highest activity as per router stats, so far customer is deleting the console pod and process decreasing from 45 to 12. 
      
      Customer is willing to have a guidance about how to identify the process that is consuming the memory, haproxy monitoring is enabled but no dashboard available. 
      
      Router stats from when the router has 8g-6g-3g of memory available has been requested. 

      Additional info:

       Customer is claiming that this is a happening only in OCP 4.12.45, as other active cluster is still in version 4.10.39 and this is not happening. Upgrade is blocked because of this .
      
      Requested action:
      * hard-stop-after might be an option but customer expect information about side effects of this configuration.
      * How to reset console connection from haproxy?
      * Is there any documentation about haproxy prometheus queries?  

              amcdermo@redhat.com Andrew McDermott
              rhn-support-pescorza Pamela Lizeth Escorza Gil
              Shudi Li Shudi Li
              Melvin Joseph
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: