Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-39291

High CPU Utilization in openshift-ingress pod

XMLWordPrintable

    • Moderate
    • None
    • 2
    • NE Sprint 259
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The customer is experiencing a Node System Saturation warning alert in their cluster, indicating high CPU utilization by `router-default-*` pods.
      
      Further investigation revealed that one of the `router-default-*` pods is consuming more CPU resources than the other, hosted on the node `ip-10-191-2-115.us-east-2.compute.internal`.
      
      
      $ oc adm top pod --namespace=openshift-ingress
      NAME                              CPU(cores)   MEMORY(bytes)   
      router-default-566cd8d4f9-4rkpl   1079m        1134Mi          
      router-default-566cd8d4f9-9rjt2   69m          978Mi    
      
      $ oc get pod -n openshift-ingress -o wide
      NAME                              READY   STATUS    RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
      router-default-566cd8d4f9-4rkpl   1/1     Running   0          6d6h   10.129.14.15   ip-10-191-2-115.us-east-2.compute.internal   <none>           <none>
      router-default-566cd8d4f9-9rjt2   1/1     Running   0          6d6h   10.128.14.26   ip-10-191-3-245.us-east-2.compute.internal   <none>           <none>
      
      
      Errors found with the canary route and suggested customer follow a KCS.: https://access.redhat.com/solutions/7049958
      
      2024-08-22T16:07:46.340Z    ERROR    operator.init    
      controller/controller.go:265    Reconciler error    {"controller": 
      "canary_controller", "object": 
      {"name":"default","namespace":"openshift-ingress-operator"}
      , "namespace": "openshift-ingress-operator", "name": "default", 
      "reconcileID": "60c36bb2-175b-44f7-848f-f36486e7adf1", "error": "failed 
      to ensure canary route: failed to update canary route 
      openshift-ingress-canary/canary: Route.route.openshift.io \"canary\" is 
      invalid: spec.subdomain: Invalid value: 
      \"canary-openshift-ingress-canary\": field is immutable"}
      
      It has been solved, but the high consumption persists     
      
      The severity is 3, the customer said It is impacting the performance of the cluster 

      Version-Release number of selected component (if applicable):

      4.14.27

      How reproducible:

      n/a

      Steps to Reproduce:

      n/a 

      Actual results:

      cpu usage spikes

      Expected results:

      cpu usage stays the same

      Additional info:

      ID:                     2538ee2sg8k2545usf8k3oa5dshetlko
      External ID:            0b0b1edf-429a-4d96-8473-94d2d0897fc2
      Name:                   dragon-rosa-dev
      Domain Prefix:          dragon-rosa-dev
      Display Name:           dragon-rosa-dev
      State:                  ready 
      API URL:                https://api.dragon-rosa-dev.qrvr.p1.openshiftapps.com:6443
      API Listening:          external
      Console URL:            https://console-openshift-console.apps.dragon-rosa-dev.qrvr.p1.openshiftapps.com
      Cluster History URL:    https://cloud.redhat.com/openshift/details/s/2SpDpWh2hK3BZt9yMVwz2sXwkqQ#clusterHistory
      Control Plane:
                              Replicas: 3
      Infra:
                              Replicas: 3
      Compute:
                              Replicas: 0
      Product:                rosa
      Subscription type:      standard
      Provider:               aws
      Version:                4.14.27
      Region:                 us-east-2
      Multi-az:               true
      PrivateLink:            false
      STS:                    true
      Subnet IDs:             [subnet-0c89fa69d00344a0b subnet-03509303eb4511acb subnet-0e790c49254221d54 subnet-0239df7d945d7cdfc subnet-0443b00d271e446ec subnet-0d8d92b411e76eb6f]
      CCS:                    true
      HCP:                    false
      Existing VPC:           true
      Channel Group:          stable
      Cluster Admin:          true
      Organization:           Dragon DevOps
      Creator:                dragondevops@ibm.com
      
      
      please review "must gather" report attached for any potential bugs (secrets.yaml has been intentionally removed)

       

      There doesn't seem to be a significant change in HTTP requests or number of connections in the same timeframe. So it's not directly connected to load. https://drive.google.com/file/d/1Ki-8r3ijFItqiBjlTPFiWBpqL4HVZnit/view?usp=drive_link

      cpu usage https://drive.google.com/file/d/1hqvLCPLxYG-CQSe27qN98BMLYcUHAaLJ/view?usp=drive_link

      memory usage https://drive.google.com/file/d/1EV-rIIKxFk_AdkTWktZ83m1abGnCVRb-/view?usp=drive_link

      ec2 instance resources serving 2 above pods (note cpu utilization spike) https://drive.google.com/file/d/1Px2SBOMuqHDcVWJTOnyT7yfUqlgFQsMg/view?usp=drive_link

       

      MUST GATHER REPORT https://drive.google.com/file/d/18YdtQyppymv9pL6AvHZdapsGnDFarcf2/view?usp=drive_link

       

      Acceptance criterira:

      • Review must gather for potential bugs
      • seek SRE-P assistance if anything else is needed from the cluster and to run remediation

              cholman@redhat.com Candace Holman
              todabasi.openshift Tomas Dabasinskas
              Hongan Li Hongan Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: