Customer has multiple clusters running on OpenShift v4.16.36 and v4.16.38
They have a huge fleet of ansible pods running on the clusters. Post upgrading from 4.15 to 4.16.36/38, they started observing periodic pods going into containerCreating state.
The nodes on several occasions go into Not Ready state and then later replaced by MachineHealthCheck.
This is happening fleetwide. Customer is using Azure RedHat OpenShift.
We can see periodic spikes in the etcd and logs of the application indicate:
`"error":"leader election lost"`
Checking the API calls made in the last 2 days on the cluster, the following result were concerning:
user_username verb occurences system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount update 1820155 system:serviceaccount:kube-system:deployment-controller update 1212919 system:serviceaccount:openshift-machine-config-operator:machine-config-operator get 1077703 system:serviceaccount:openshift-gitops-operator:openshift-gitops-operator-controller-manager update 895522 system:serviceaccount:aap-jobs:aap-jobs get 760996
I am adding a must-gather and sos-report from the cluster as well.
Could this be a regression of: https://issues.redhat.com/browse/OCPBUGS-48696 ?