Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32977

haproxy oom - troubleshoot process

XMLWordPrintable

    • Critical
    • No
    • 1
    • Sprint 252, Sprint 253
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link:https://issues.redhat.com/browse/OCPBUGS-32977[*OCPBUGS-32977*])
      ________________________
      Cause:
      Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the "random" algorithm excessively in environments with many inactive services or those routing backends with weight 0.

      Consequence:
      This led to increased memory usage and a higher risk of excessive memory consumption.

      Fix:
      Enhanced the service filtering logic in load balancing: Inactive services are now excluded when calculating weights. When there is only one active service, its weight is set to 1 to direct traffic exclusively to it. Improved template logic to more accurately handle service activity for algorithm settings.

      Results:
      These changes optimise traffic direction towards active services only and prevent unnecessary use of the "random" algorithm with higher weights, thereby reducing the potential for excessive memory consumption.
      Show
      * Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link: https://issues.redhat.com/browse/OCPBUGS-32977 [* OCPBUGS-32977 *]) ________________________ Cause: Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the "random" algorithm excessively in environments with many inactive services or those routing backends with weight 0. Consequence: This led to increased memory usage and a higher risk of excessive memory consumption. Fix: Enhanced the service filtering logic in load balancing: Inactive services are now excluded when calculating weights. When there is only one active service, its weight is set to 1 to direct traffic exclusively to it. Improved template logic to more accurately handle service activity for algorithm settings. Results: These changes optimise traffic direction towards active services only and prevent unnecessary use of the "random" algorithm with higher weights, thereby reducing the potential for excessive memory consumption.
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-29690. The following is the description of the original issue:

      Description of problem:

          Router are restarting due to memory issues

      Version-Release number of selected component (if applicable):

          OCP 4.12.45

      How reproducible:

          not easy
      Router restart due to memory issues:
      ~~~
      3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
      3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Liveness probe error: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
      3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Liveness probe failed: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      3h40m       Normal    Killing      pod/router-default-56c9f67f66-j8xwn                        Container router failed liveness probe, will be restarted
      3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: HTTP probe failed with statuscode: 500...
      3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: HTTP probe failed with statuscode: 500
      ~~~
      
      The node only host the router replica, and from prometheus it can be verified that routers are consumming all the memory in a short period of time ~20G with an hour.
      
      At some point, the number of haproxy are increasing and ending consuming all memory resources leading in a service disruption in a productive environment.
      
      As console is one of the service with highest activity as per router stats, so far customer is deleting the console pod and process decreasing from 45 to 12. 
      
      Customer is willing to have a guidance about how to identify the process that is consuming the memory, haproxy monitoring is enabled but no dashboard available. 
      
      Router stats from when the router has 8g-6g-3g of memory available has been requested. 

      Additional info:

       Customer is claiming that this is a happening only in OCP 4.12.45, as other active cluster is still in version 4.10.39 and this is not happening. Upgrade is blocked because of this .
      
      Requested action:
      * hard-stop-after might be an option but customer expect information about side effects of this configuration.
      * How to reset console connection from haproxy?
      * Is there any documentation about haproxy prometheus queries?  

              amcdermo@redhat.com Andrew McDermott
              openshift-crt-jira-prow OpenShift Prow Bot
              Shudi Li Shudi Li
              Melvin Joseph
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: