Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15
Component/s: OLM / Registry
Labels:
None

Severity:
Moderate
Regression:
None
Sprint:
Glaceon OLM Sprint 267
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

redhat-operators pod consuming lot of the master node's CPU.

Version-Release number of selected component (if applicable):

OCP 4.15.35

How reproducible:

Not reproducible

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

registry-server container (redhat-operators pod) consuming 2.61 CPU cores on the master node where it is running (moving the pod to another node have the same result). 

# cat sosreport-ent-xkjjh-master-0/sos_commands/crio/crictl_stats | sort -nk3 | tail -2 | column -t
CONTAINER      NAME             CPU %   MEM      DISK     INODES 
61c30b2fd9211  registry-server  261.19  54.72MB  8.192kB  14
f8b08095674f1  kube-apiserver   272.09  9.659GB  245.8kB  27

Expected results:

redhat-operators pod should not consume that much CPU.

Additional info:

We collected CPU and memory profiles with pprof and these are the results:

CPU profile

# go tool pprof -top profile.pb                 

File: opm
Build ID: c34f21afc19ad938b07c9b76ac976d40e72bab9a
Type: cpu
Time: Feb 12, 2025 at 9:46am (CET)
Duration: 30s, Total samples = 77.25s (257.50%)
Showing nodes accounting for 70.31s, 91.02% of 77.25s total
Dropped 342 nodes (cum <= 0.39s)
      flat  flat%   sum%        cum   cum%
    12.69s 16.43% 16.43%     38.33s 49.62%  runtime.scanobject
...
     2.57s  3.33% 52.57%     48.25s 62.46%  runtime.gcDrain
...

Memory profile

# go tool pprof -alloc_space -top heap_profile.pb            
File: opm
Build ID: c34f21afc19ad938b07c9b76ac976d40e72bab9a
Type: alloc_space
Time: Feb 12, 2025 at 11:38am (CET)
Showing nodes accounting for 1351884.79MB, 98.63% of 1370616.60MB total
Dropped 334 nodes (cum <= 6853.08MB)
      flat  flat%   sum%        cum   cum%
694944.35MB 50.70% 50.70% 694944.35MB 50.70%  encoding/json.unquoteBytes
646676.62MB 47.18% 97.88% 1341620.97MB 97.88%  encoding/json.(*decodeState).literalStore
 8589.47MB  0.63% 98.51%  8589.47MB  0.63%  google.golang.org/protobuf/proto.MarshalOptions.marshal
...

It seems that the opm binary running inside the pod redhat-operators is allocating a lot of memory to decode JSON values causing the golang garbage collector to use most of the CPU.

I asked the customer to get the bundles from the catalogsource redhat-operators by executing:

oc run -n openshift-marketplace grpcurl -ti --rm --image=quay.io/gmeghnag/grpcurl --command -q -- grpcurl -plaintext -d '{}' redhat-operators.openshift-marketplace.svc:50051 api.Registry/ListBundles > bundles.json

But I don't see anything weird from that bundles.json (its size is not different from the one collected from my lab cluster).

duplicates

OCPBUGS-36421 redhat-operators pod experiencing unusually high CPU utilization

Closed

relates to

OCPBUGS-48697 OLMv0: excessive catalog source snapshots cause severe performance regression [openshift-4.15.z]

Closed

Assignee:: Jordan Keister

Reporter:: Gabriel Meghnagi

QA Contact:: Jian Zhang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/02/17 3:42 PM

Updated:: 2025/02/18 8:40 AM

Resolved:: 2025/02/17 7:40 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates