-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
4.14.z
-
Moderate
-
None
-
2
-
NE Sprint 259
-
1
-
Rejected
-
False
-
Description of problem:
The customer is experiencing a Node System Saturation warning alert in their cluster, indicating high CPU utilization by `router-default-*` pods. Further investigation revealed that one of the `router-default-*` pods is consuming more CPU resources than the other, hosted on the node `ip-10-191-2-115.us-east-2.compute.internal`. $ oc adm top pod --namespace=openshift-ingress NAME CPU(cores) MEMORY(bytes) router-default-566cd8d4f9-4rkpl 1079m 1134Mi router-default-566cd8d4f9-9rjt2 69m 978Mi $ oc get pod -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-566cd8d4f9-4rkpl 1/1 Running 0 6d6h 10.129.14.15 ip-10-191-2-115.us-east-2.compute.internal <none> <none> router-default-566cd8d4f9-9rjt2 1/1 Running 0 6d6h 10.128.14.26 ip-10-191-3-245.us-east-2.compute.internal <none> <none> Errors found with the canary route and suggested customer follow a KCS.: https://access.redhat.com/solutions/7049958 2024-08-22T16:07:46.340Z ERROR operator.init controller/controller.go:265 Reconciler error {"controller": "canary_controller", "object": {"name":"default","namespace":"openshift-ingress-operator"} , "namespace": "openshift-ingress-operator", "name": "default", "reconcileID": "60c36bb2-175b-44f7-848f-f36486e7adf1", "error": "failed to ensure canary route: failed to update canary route openshift-ingress-canary/canary: Route.route.openshift.io \"canary\" is invalid: spec.subdomain: Invalid value: \"canary-openshift-ingress-canary\": field is immutable"} It has been solved, but the high consumption persists The severity is 3, the customer said It is impacting the performance of the cluster
Version-Release number of selected component (if applicable):
4.14.27
How reproducible:
n/a
Steps to Reproduce:
n/a
Actual results:
cpu usage spikes
Expected results:
cpu usage stays the same
Additional info:
ID: 2538ee2sg8k2545usf8k3oa5dshetlko External ID: 0b0b1edf-429a-4d96-8473-94d2d0897fc2 Name: dragon-rosa-dev Domain Prefix: dragon-rosa-dev Display Name: dragon-rosa-dev State: ready API URL: https://api.dragon-rosa-dev.qrvr.p1.openshiftapps.com:6443 API Listening: external Console URL: https://console-openshift-console.apps.dragon-rosa-dev.qrvr.p1.openshiftapps.com Cluster History URL: https://cloud.redhat.com/openshift/details/s/2SpDpWh2hK3BZt9yMVwz2sXwkqQ#clusterHistory Control Plane: Replicas: 3 Infra: Replicas: 3 Compute: Replicas: 0 Product: rosa Subscription type: standard Provider: aws Version: 4.14.27 Region: us-east-2 Multi-az: true PrivateLink: false STS: true Subnet IDs: [subnet-0c89fa69d00344a0b subnet-03509303eb4511acb subnet-0e790c49254221d54 subnet-0239df7d945d7cdfc subnet-0443b00d271e446ec subnet-0d8d92b411e76eb6f] CCS: true HCP: false Existing VPC: true Channel Group: stable Cluster Admin: true Organization: Dragon DevOps Creator: dragondevops@ibm.com please review "must gather" report attached for any potential bugs (secrets.yaml has been intentionally removed)
There doesn't seem to be a significant change in HTTP requests or number of connections in the same timeframe. So it's not directly connected to load. https://drive.google.com/file/d/1Ki-8r3ijFItqiBjlTPFiWBpqL4HVZnit/view?usp=drive_link
cpu usage https://drive.google.com/file/d/1hqvLCPLxYG-CQSe27qN98BMLYcUHAaLJ/view?usp=drive_link
memory usage https://drive.google.com/file/d/1EV-rIIKxFk_AdkTWktZ83m1abGnCVRb-/view?usp=drive_link
ec2 instance resources serving 2 above pods (note cpu utilization spike) https://drive.google.com/file/d/1Px2SBOMuqHDcVWJTOnyT7yfUqlgFQsMg/view?usp=drive_link
MUST GATHER REPORT https://drive.google.com/file/d/18YdtQyppymv9pL6AvHZdapsGnDFarcf2/view?usp=drive_link
Acceptance criterira:
- Review must gather for potential bugs
- seek SRE-P assistance if anything else is needed from the cluster and to run remediation