[OCPBUGS-39291] High CPU Utilization in openshift-ingress pod - Red Hat Issue Tracker

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Networking / router
Labels:
- SREsPerCoreImpact-Medium
- ServiceDeliveryImpact

Severity:
Moderate
Regression:
None
Story Points:
2
Sprint:
NE Sprint 259
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

The customer is experiencing a Node System Saturation warning alert in their cluster, indicating high CPU utilization by `router-default-*` pods.

Further investigation revealed that one of the `router-default-*` pods is consuming more CPU resources than the other, hosted on the node `ip-10-191-2-115.us-east-2.compute.internal`.


$ oc adm top pod --namespace=openshift-ingress
NAME                              CPU(cores)   MEMORY(bytes)   
router-default-566cd8d4f9-4rkpl   1079m        1134Mi          
router-default-566cd8d4f9-9rjt2   69m          978Mi    

$ oc get pod -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
router-default-566cd8d4f9-4rkpl   1/1     Running   0          6d6h   10.129.14.15   ip-10-191-2-115.us-east-2.compute.internal   <none>           <none>
router-default-566cd8d4f9-9rjt2   1/1     Running   0          6d6h   10.128.14.26   ip-10-191-3-245.us-east-2.compute.internal   <none>           <none>


Errors found with the canary route and suggested customer follow a KCS.: https://access.redhat.com/solutions/7049958

2024-08-22T16:07:46.340Z    ERROR    operator.init    
controller/controller.go:265    Reconciler error    {"controller": 
"canary_controller", "object": 
{"name":"default","namespace":"openshift-ingress-operator"}
, "namespace": "openshift-ingress-operator", "name": "default", 
"reconcileID": "60c36bb2-175b-44f7-848f-f36486e7adf1", "error": "failed 
to ensure canary route: failed to update canary route 
openshift-ingress-canary/canary: Route.route.openshift.io \"canary\" is 
invalid: spec.subdomain: Invalid value: 
\"canary-openshift-ingress-canary\": field is immutable"}

It has been solved, but the high consumption persists     

The severity is 3, the customer said It is impacting the performance of the cluster

Version-Release number of selected component (if applicable):

4.14.27

How reproducible:

n/a

Steps to Reproduce:

n/a

Actual results:

cpu usage spikes

Expected results:

cpu usage stays the same

Additional info:

ID:                     2538ee2sg8k2545usf8k3oa5dshetlko
External ID:            0b0b1edf-429a-4d96-8473-94d2d0897fc2
Name:                   dragon-rosa-dev
Domain Prefix:          dragon-rosa-dev
Display Name:           dragon-rosa-dev
State:                  ready 
API URL:                https://api.dragon-rosa-dev.qrvr.p1.openshiftapps.com:6443
API Listening:          external
Console URL:            https://console-openshift-console.apps.dragon-rosa-dev.qrvr.p1.openshiftapps.com
Cluster History URL:    https://cloud.redhat.com/openshift/details/s/2SpDpWh2hK3BZt9yMVwz2sXwkqQ#clusterHistory
Control Plane:
                        Replicas: 3
Infra:
                        Replicas: 3
Compute:
                        Replicas: 0
Product:                rosa
Subscription type:      standard
Provider:               aws
Version:                4.14.27
Region:                 us-east-2
Multi-az:               true
PrivateLink:            false
STS:                    true
Subnet IDs:             [subnet-0c89fa69d00344a0b subnet-03509303eb4511acb subnet-0e790c49254221d54 subnet-0239df7d945d7cdfc subnet-0443b00d271e446ec subnet-0d8d92b411e76eb6f]
CCS:                    true
HCP:                    false
Existing VPC:           true
Channel Group:          stable
Cluster Admin:          true
Organization:           Dragon DevOps
Creator:                dragondevops@ibm.com


please review "must gather" report attached for any potential bugs (secrets.yaml has been intentionally removed)

There doesn't seem to be a significant change in HTTP requests or number of connections in the same timeframe. So it's not directly connected to load. https://drive.google.com/file/d/1Ki-8r3ijFItqiBjlTPFiWBpqL4HVZnit/view?usp=drive_link

cpu usage https://drive.google.com/file/d/1hqvLCPLxYG-CQSe27qN98BMLYcUHAaLJ/view?usp=drive_link

memory usage https://drive.google.com/file/d/1EV-rIIKxFk_AdkTWktZ83m1abGnCVRb-/view?usp=drive_link

ec2 instance resources serving 2 above pods (note cpu utilization spike) https://drive.google.com/file/d/1Px2SBOMuqHDcVWJTOnyT7yfUqlgFQsMg/view?usp=drive_link

MUST GATHER REPORT https://drive.google.com/file/d/18YdtQyppymv9pL6AvHZdapsGnDFarcf2/view?usp=drive_link

Acceptance criterira:

Review must gather for potential bugs

seek SRE-P assistance if anything else is needed from the cluster and to run remediation

Assignee:: Candace Holman

Reporter:: Tomas Dabasinskas

QA Contact:: Hongan Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/08/30 1:06 PM

Updated:: 2024/10/09 3:20 PM

Resolved:: 2024/09/03 2:58 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide