Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-56432

Huge api calls made by OLM operator causing pods to go into containerCreating and node issues

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.16.z
    • Etcd
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Customer has multiple clusters running on OpenShift v4.16.36 and v4.16.38

      They have a huge fleet of ansible pods running on the clusters. Post upgrading from 4.15 to 4.16.36/38, they started observing periodic pods going into containerCreating state.

      The nodes on several occasions go into Not Ready state and then later replaced by MachineHealthCheck.

      This is happening fleetwide. Customer is using Azure RedHat OpenShift.

      We can see periodic spikes in the etcd and logs of the application indicate:

      `"error":"leader election lost"`

      Checking the API calls made in the last 2 days on the cluster, the following result were concerning:

      user_username	                                                                                     verb	occurences
      
      system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount	             update	1820155
      system:serviceaccount:kube-system:deployment-controller                         	             update	1212919
      system:serviceaccount:openshift-machine-config-operator:machine-config-operator	                     get	1077703
      system:serviceaccount:openshift-gitops-operator:openshift-gitops-operator-controller-manager	     update	895522
      system:serviceaccount:aap-jobs:aap-jobs	                                                             get	760996
      

      I am adding a must-gather and sos-report from the cluster as well.

      Could this be a regression of: https://issues.redhat.com/browse/OCPBUGS-48696 ?

              bluddy Ben Luddy
              rhn-support-maupadhy Madhusudan Upadhyay
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: