[COO-784] Increase memory limit for COO and perses pods

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: 1.1.1
Affects Version/s: 1.1.0
Component/s: operator, perses
Labels:
- oomkilled
- operator

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Hello team,

With the release of COO 1.1 we are getting few cases related to observability-operator and perses-operator pods are getting OOMkilled.

The default limit for observability operator is 150Mi where as for perses operator is 128Mi

I would suggest to set the default limit for these 2 operator pods to be atleast 500Mi

links to

ClusterObservability and perses operator pod in CrashLoopBackOff due to OOMKilled in RHOCP4

rhobs/observability-operator#721: COO-784: fix: raise resource requests and limits for operators

Mark Stalpinski added a comment - 2025/04/09 8:19 PM

Also ran into this today and completed the workaround, so good right now

Mark Stalpinski added a comment - 2025/04/09 8:19 PM Also ran into this today and completed the workaround, so good right now

Jan Fajerski added a comment - 2025/04/03 9:26 AM

rhn-support-pripatil We are still analyzing the specifics, more memory profiles are always welcome.

COO 1.1 ships significantly more features, including a new operator (Perses). This requires additional resources the operators must keep track of via Watches. Currently we suspect the increased memory requirements stem from that aspect.

Jan Fajerski added a comment - 2025/04/03 9:26 AM rhn-support-pripatil We are still analyzing the specifics, more memory profiles are always welcome. COO 1.1 ships significantly more features, including a new operator (Perses). This requires additional resources the operators must keep track of via Watches. Currently we suspect the increased memory requirements stem from that aspect.

Prithviraj Patil added a comment - 2025/04/02 7:42 PM

Hello Team/jfajersk@redhat.com ,

One of my customers is also experiencing the same issue.
I have one query. Could you please answer me:

The COO was running fine before this update with 150Mi. With this new update, memory jump from 150Mi, to 512Mi is nearly 4x more. It seems more like an issue with this update than a resource constraint.

So could you please confirm, What changed with this new update?

Regards,
Prithviraj Patil

Prithviraj Patil added a comment - 2025/04/02 7:42 PM Hello Team/ jfajersk@redhat.com , One of my customers is also experiencing the same issue. I have one query. Could you please answer me: The COO was running fine before this update with 150Mi. With this new update, memory jump from 150Mi, to 512Mi is nearly 4x more. It seems more like an issue with this update than a resource constraint. So could you please confirm, What changed with this new update? Regards, Prithviraj Patil

Hongyan Li added a comment - 2025/04/02 5:51 AM

Test pass with PR

% oc -n coo get deployment perses-operator  -oyaml | grep -A6 resources:      
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 128Mi
% oc -n coo get deployment observability-operator -oyaml | grep -A6 resources: 
        resources:
          limits:
            cpu: 400m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi

Hongyan Li added a comment - 2025/04/02 5:51 AM Test pass with PR % oc -n coo get deployment perses- operator -oyaml | grep -A6 resources: resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 128Mi % oc -n coo get deployment observability- operator -oyaml | grep -A6 resources: resources: limits: cpu: 400m memory: 512Mi requests: cpu: 100m memory: 256Mi

Hongyan Li added a comment - 2025/04/01 2:49 AM

RCA:
QE cluster usually 3 master nodes and 3 work nodes, I have never seen the issue. COO is multi-namespace cluster, I suppose the cluster environment has more namespaces which should have effects on OOM of COO pods. Performance QE team has multi-nodes cluster environment which may have more namespaces, this scenario may be covered by them.
See the issue on cluster which has 29 nodes

Hongyan Li added a comment - 2025/04/01 2:49 AM RCA: QE cluster usually 3 master nodes and 3 work nodes, I have never seen the issue. COO is multi-namespace cluster, I suppose the cluster environment has more namespaces which should have effects on OOM of COO pods. Performance QE team has multi-nodes cluster environment which may have more namespaces, this scenario may be covered by them. See the issue on cluster which has 29 nodes

Sonigra Saurab added a comment - 2025/03/31 5:39 PM

Changes made. I see why you recommedned it to add it at subscription level so even if the operator auto upgrades then those config stays and custoemr need not make the changes again to the new csv

Sonigra Saurab added a comment - 2025/03/31 5:39 PM Changes made. I see why you recommedned it to add it at subscription level so even if the operator auto upgrades then those config stays and custoemr need not make the changes again to the new csv

Sonigra Saurab added a comment - 2025/03/31 5:31 PM

Jan each of the component has different limit and request for CPU & memory, if I add the changes directly to sub , it takes those values as default request and limit ,

But I get your point, I think request of 50m cpu and 150Mi memory + limit of 500m cpu and 512Mi Memory at sub level should be good

Sonigra Saurab added a comment - 2025/03/31 5:31 PM Jan each of the component has different limit and request for CPU & memory, if I add the changes directly to sub , it takes those values as default request and limit , But I get your point, I think request of 50m cpu and 150Mi memory + limit of 500m cpu and 512Mi Memory at sub level should be good

Jan Fajerski added a comment - 2025/03/31 4:23 PM

Regarding the KCS: is there a benefit when adjusting the CSV over setting this in the subscription like OLM documents? https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#resources

Jan Fajerski added a comment - 2025/03/31 4:23 PM Regarding the KCS: is there a benefit when adjusting the CSV over setting this in the subscription like OLM documents? https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#resources

Sonigra Saurab added a comment - 2025/03/31 2:05 PM

KCS

Sonigra Saurab added a comment - 2025/03/31 2:05 PM KCS

Sonigra Saurab added a comment - 2025/03/31 2:01 PM

I have asked the custoemr to set the limit to 512Mi for temporary basis and see if the same problem is still happening

Sonigra Saurab added a comment - 2025/03/31 2:01 PM I have asked the custoemr to set the limit to 512Mi for temporary basis and see if the same problem is still happening

Assignee:: Jan Fajerski

Reporter:: Sonigra Saurab

QA Contact:: Hongyan Li

Votes:: 1 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2025/03/31 1:31 PM

Updated:: 2025/04/09 8:19 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Mark Stalpinski added a comment - 2025/04/09 8:19 PM

Expand comment: Mark Stalpinski added a comment - 2025/04/09 8:19 PM

Collapse comment: Jan Fajerski added a comment - 2025/04/03 9:26 AM

Expand comment: Jan Fajerski added a comment - 2025/04/03 9:26 AM

Collapse comment: Prithviraj Patil added a comment - 2025/04/02 7:42 PM

Expand comment: Prithviraj Patil added a comment - 2025/04/02 7:42 PM

Collapse comment: Hongyan Li added a comment - 2025/04/02 5:51 AM

Expand comment: Hongyan Li added a comment - 2025/04/02 5:51 AM

Collapse comment: Hongyan Li added a comment - 2025/04/01 2:49 AM

Expand comment: Hongyan Li added a comment - 2025/04/01 2:49 AM

Collapse comment: Sonigra Saurab added a comment - 2025/03/31 5:39 PM

Expand comment: Sonigra Saurab added a comment - 2025/03/31 5:39 PM

Collapse comment: Sonigra Saurab added a comment - 2025/03/31 5:31 PM

Expand comment: Sonigra Saurab added a comment - 2025/03/31 5:31 PM

Collapse comment: Jan Fajerski added a comment - 2025/03/31 4:23 PM

Expand comment: Jan Fajerski added a comment - 2025/03/31 4:23 PM

Collapse comment: Sonigra Saurab added a comment - 2025/03/31 2:05 PM

Expand comment: Sonigra Saurab added a comment - 2025/03/31 2:05 PM

Collapse comment: Sonigra Saurab added a comment - 2025/03/31 2:01 PM

Expand comment: Sonigra Saurab added a comment - 2025/03/31 2:01 PM

People

Dates

PagerDuty