Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-4745

Set default limits for Vector if not user defined values

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • Logging 6.0.0
    • Logging 5.7.z, Logging 5.6.z, Logging 5.8.z
    • Log Collection
    • False
    • None
    • False
    • NEW
    • NEW
    • Before this change the vector collector deployments were unrestricted in their use of memory and cpu. This enhancement defaults the requests and limits to values recommended by the vector upstream documentation.
    • Enhancement
    • Log Collection - Sprint 258
    • Important

      Description of problem:

      In case that not user defined values of limits.cpu and limits.memory,  Vector has not default values, then, as it's designed to work in memory, it can lead to get exhausted an OCP node (including master nodes).

      The high memory usage is reported in https://issues.redhat.com/browse/LOG-4536 where Vector pods where able to use until 160GB of memory.

      Then, it's required that in case that not user defined values for limits.cpu and limits.memory for the collector, the default installation should set some values for avoiding Vector consuming all the memory or cpu available on the nodes and impacting to the cluster

      Version-Release number of selected component (if applicable):

      Logging 5.6, 5.7 and 5.8

      How reproducible:

      Always

      Steps to Reproduce:

      Deploy Vector without having user defined values and verify that the collector pods have not  _ limits.cpu_ or limits.memory values

      Actual results:

      If not user defined values of limits.cpu and limits.memory then, not default values assigned being not a secure and good configuration out of the box for the collector.

      Expected results:

      If not user defined values of limits.cpu and limits.memory, then, Vector collector can use all the memory and cpu from the nodes and it does it causing problems on the OCP nodes and the loads running on top of it.

            [LOG-4745] Set default limits for Vector if not user defined values

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Logging for Red Hat OpenShift - 6.0.0), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHBA-2024:6693

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Logging for Red Hat OpenShift - 6.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:6693

            Anping Li added a comment - - edited

            Verified, the ds are given default values as below 
            oc get ds collector -o yaml|yq '.spec.template.spec.containers[].resources'
            limits:
              cpu: "6"
              memory: 1Gi
            requests:
              cpu: 500m
              memory: 64Mi

            Anping Li added a comment - - edited Verified, the ds are given default values as below  oc get ds collector -o yaml|yq '.spec.template.spec.containers[].resources' limits:   cpu: "6"   memory: 1Gi requests:   cpu: 500m   memory: 64Mi

            CPaaS Service Account mentioned this issue in merge request !4229 of openshift-logging / Log Collection Midstream on branch openshift-logging-6.0-rhel-9_upstream_7b15316e65c42cf59a56dc706ff37862:

            Updated US source to: 44af754 LOG-4745: Default collector resources when undefined

            GitLab CEE Bot added a comment - CPaaS Service Account mentioned this issue in merge request !4229 of openshift-logging / Log Collection Midstream on branch openshift-logging-6.0-rhel-9_ upstream _7b15316e65c42cf59a56dc706ff37862 : Updated US source to: 44af754 LOG-4745 : Default collector resources when undefined

            Oscar Casal Sanchez added a comment - - edited

            Hello jamparke@redhat.com ,

            In the bugs referenced above is listed more cases where impacted for not having limits and also, we have more support cases that usually they don't finish on you because the problem appears, support case is opened and we tell them to set limits.

            The "default" should only set when not user defined limits, then, with this logic, never is touched the values in case that a customer was defining them. For the rest of the users, "Release Notes" is the right place to announce this change.

            For the users not setting limits, if this change was applied, and the pods reaches the "default" limits, then, if it's on Memory, an alert is triggered for this in the cluster and if it's for cpu, then, cpu throttling is present and also other alert exists for this. Then, the alerts + Release Notes is the correct way of protecting to the users to don't suffer a bigger impact and let them to know how to do the correct things delivering by default a product with better values that doesn't impact to all the cluster.

            Best regards,

            Oscar

            Oscar Casal Sanchez added a comment - - edited Hello jamparke@redhat.com , In the bugs referenced above is listed more cases where impacted for not having limits and also, we have more support cases that usually they don't finish on you because the problem appears, support case is opened and we tell them to set limits. The "default" should only set when not user defined limits, then, with this logic, never is touched the values in case that a customer was defining them. For the rest of the users, "Release Notes" is the right place to announce this change. For the users not setting limits, if this change was applied, and the pods reaches the "default" limits, then, if it's on Memory, an alert is triggered for this in the cluster and if it's for cpu, then, cpu throttling is present and also other alert exists for this. Then, the alerts + Release Notes is the correct way of protecting to the users to don't suffer a bigger impact and let them to know how to do the correct things delivering by default a product with better values that doesn't impact to all the cluster. Best regards, Oscar

            Found this reference about recommendations https://vector.dev/docs/reference/configuration/sources/kubernetes_logs/#resource-limits The caveat is how we address upgrade for people who are unaffected by the original reason this issue was opened. jamparke@redhat.com I defer to you if you wish to re-open this issue given I see only one associated customer case

            Jeffrey Cantrill added a comment - Found this reference about recommendations https://vector.dev/docs/reference/configuration/sources/kubernetes_logs/#resource-limits The caveat is how we address upgrade for people who are unaffected by the original reason this issue was opened. jamparke@redhat.com I defer to you if you wish to re-open this issue given I see only one associated customer case

            The not set limits is for CORE components and Logging is not a core component,

            I disagree with this statement. While true  technically it is not a core component, that is not how people treat monitoring and logging. Additionally, I believe it is reasonable to follow the guidelines given the intent of log collection. We have seen throughout the history of logging dating back to 3.x that users either set a limit or utilize the one provided out-of-the-box and then do not revisit the issue until logging starts to fail. Usually, because they have expanded the number of nodes or workloads for the cluster but do not consider they also have to scale the logging system to support it.  The expectation is logging more or less scales infinitely.

            The second thing is that fluentd was having limits by default when nobody was setting explicitly

            This was not done because we had any data to support the values it provided or to ensure the collectors did not consume too much memory. Defaults were added in 3.9 (maybe 3.11) to address collector pods being evicted from nodes where the scheduler could not provide them any memory. The solution proposed was to set a request/limit to ensure the collector had exactly some memory and would land on nodes and not be evicted.

            we wants exactly the Logging components being restarted, killed, degraded without impacting to the vital functions of the cluster

            We made the decision many releases ago to push more configuration into the hands of the admins. There is nothing keeping them from setting a memory limit when deploying logging. Engineering should provide better data of how the collector behaves under certain cluster conditions but IMO the original statement I quoted remains true to log collection. The workloads and number of nodes will increase, log volume will vary, ... logging "cannot anticipate how they scale in usage in all customer environments, so setting one-size-fits-all limits is not possible."

            A second day tool is not a core OpenShift component and RH shouldn't change the behaviour allowing it to use all the memory and all the cpu by default when it's installed.

            Again, we are not a core component but that is not the view IMO of many customers given the value they place upon log infrastructure. We should:

            • Document some performance numbers based upon cluster scenarios
            • Document to customers they should consider setting a limit

            Jeffrey Cantrill added a comment - The not set limits is for CORE components and Logging is not a core component, I disagree with this statement. While true  technically it is not a core component, that is not how people treat monitoring and logging. Additionally, I believe it is reasonable to follow the guidelines given the intent of log collection. We have seen throughout the history of logging dating back to 3.x that users either set a limit or utilize the one provided out-of-the-box and then do not revisit the issue until logging starts to fail. Usually, because they have expanded the number of nodes or workloads for the cluster but do not consider they also have to scale the logging system to support it.  The expectation is logging more or less scales infinitely. The second thing is that fluentd was having limits by default when nobody was setting explicitly This was not done because we had any data to support the values it provided or to ensure the collectors did not consume too much memory. Defaults were added in 3.9 (maybe 3.11) to address collector pods being evicted from nodes where the scheduler could not provide them any memory. The solution proposed was to set a request/limit to ensure the collector had exactly some memory and would land on nodes and not be evicted. we wants exactly the Logging components being restarted, killed, degraded without impacting to the vital functions of the cluster We made the decision many releases ago to push more configuration into the hands of the admins. There is nothing keeping them from setting a memory limit when deploying logging. Engineering should provide better data of how the collector behaves under certain cluster conditions but IMO the original statement I quoted remains true to log collection. The workloads and number of nodes will increase, log volume will vary, ... logging "cannot anticipate how they scale in usage in all customer environments, so setting one-size-fits-all limits is not possible." A second day tool is not a core OpenShift component and RH shouldn't change the behaviour allowing it to use all the memory and all the cpu by default when it's installed. Again, we are not a core component but that is not the view IMO of many customers given the value they place upon log infrastructure. We should: Document some performance numbers based upon cluster scenarios Document to customers they should consider setting a limit

            Looking at kube recommendations we should not be setting limits, maybe request. The interesting points are:

            • Components cannot anticipate how they scale in usage in all customer environments, so setting one-size-fits-all limits is not possible.
            • Setting static limits prevents administrators from responding to changes in their clusters’ needs, such as by resizing control plane nodes to provide more resources.
            • We do not want cluster components to be restarted based on their resource consumption (for example, being killed due to an out-of-memory condition). We need to detect and handle those cases more gracefully, without degrading cluster performance.

            Therefore, cluster components SHOULD NOT be configured with resource limits.
            However, cluster components MUST declare resource requests for both CPU and memory.

            Limits/Requests for memory were originally added to components in 3.x to:

            • ensure the logging components had all the memory they would require at boot and not be evicted

            I don't recall there ever being solid reasoning for the values chosen then. Closing this issue as WONTDO given kubernetes recommendations. For now I do not see the need to provide a request

            Jeffrey Cantrill added a comment - Looking at kube recommendations we should not be setting limits, maybe request. The interesting points are: Components cannot anticipate how they scale in usage in all customer environments, so setting one-size-fits-all limits is not possible. Setting static limits prevents administrators from responding to changes in their clusters’ needs, such as by resizing control plane nodes to provide more resources. We do not want cluster components to be restarted based on their resource consumption (for example, being killed due to an out-of-memory condition). We need to detect and handle those cases more gracefully, without degrading cluster performance. Therefore, cluster components SHOULD NOT be configured with resource limits. However, cluster components MUST declare resource requests for both CPU and memory. Limits/Requests for memory were originally added to components in 3.x to: ensure the logging components had all the memory they would require at boot and not be evicted I don't recall there ever being solid reasoning for the values chosen then. Closing this issue as WONTDO given kubernetes recommendations. For now I do not see the need to provide a request

            rhn-engineering-aconway  thoughts here?  This presents a situation where we introduce defaults and then we alter the behavior of customers who do not have issues.  We originally spec'd default limit/request for components:

            • to ensure they were able to be scheduled on a node (3.11 were smaller)
            • and would not be killed by OOM since they in theory have all their memory reserved.

            I hesitate to add a default when admins already have the control they need.  It may be we need to do more testing to figure out what the relationship is between containers, log rate, memory.  Maybe that info could drive some sort of "predictive defaults"

            Jeffrey Cantrill added a comment - rhn-engineering-aconway   thoughts here?  This presents a situation where we introduce defaults and then we alter the behavior of customers who do not have issues.  We originally spec'd default limit/request for components: to ensure they were able to be scheduled on a node (3.11 were smaller) and would not be killed by OOM since they in theory have all their memory reserved. I hesitate to add a default when admins already have the control they need.  It may be we need to do more testing to figure out what the relationship is between containers, log rate, memory.  Maybe that info could drive some sort of "predictive defaults"

            What should that limit be? Fluend deployments default a memory limit not because that value is rooted in any specific testing but because in 3.11 we:

            • wanted to guarantee the collector would land on the nodes with memory it needed
            • that was the smallest value we thought was sane to reserve

            It would be interesting to see what a "steady state" condition may be after the buffer changes land

            Jeffrey Cantrill added a comment - What should that limit be? Fluend deployments default a memory limit not because that value is rooted in any specific testing but because in 3.11 we: wanted to guarantee the collector would land on the nodes with memory it needed that was the smallest value we thought was sane to reserve It would be interesting to see what a "steady state" condition may be after the buffer changes land

              jcantril@redhat.com Jeffrey Cantrill
              rhn-support-ocasalsa Oscar Casal Sanchez
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: