Resolution: Done-Errata
CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255
Release Note Not Required
Description of problem:
- Pods managed by DaemonSets are being evicted. - This is causing that some pods of OCP components, such as for example csi drivers (and might be more) are beeing evicted before the application pods, causing those application pods going into an Error status (because CSI pod cannot do the tear down of the volumes). - As applicaiton pods remain in error status, drain operation also fails after the maxPodGracePeriod
Version-Release number of selected component (if applicable):
- 4.11
How reproducible:
- Wait for a new scale-down event
Steps to Reproduce:
1. Wait for a new scale-down event 2.Monitor csi pods (or dns, or ingress...), you will notice that they are evicted, and as it come from DaemonSets, they become scheduled again as new pods. 3. More evidences could be found from kube-api audit logs.
Actual results:
- From audit logs we can see that pods are evicted by the clusterautoscaler "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "ec999193-2c94-4710-a8c7-ff9460e30f70", "stage": "ResponseComplete", "requestURI": "/api/v1/namespaces/openshift-cluster-csi-drivers/pods/aws-efs-csi-driver-node-2l2xn/eviction", "verb": "create", "user": { "username": "system:serviceaccount:openshift-machine-api:cluster-autoscaler", "uid": "44aa427b-58a4-438a-b56e-197b88aeb85d", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-machine-api", "system:authenticated" ], "extra": { "authentication.kubernetes.io/pod-name": [ "cluster-autoscaler-default-5d4c54c54f-dx59s" ], "authentication.kubernetes.io/pod-uid": [ "d57837b1-3941-48da-afeb-179141d7f265" ] } }, "sourceIPs": [ "" ], "userAgent": "cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format", "objectRef": { "resource": "pods", "namespace": "openshift-cluster-csi-drivers", "name": "aws-efs-csi-driver-node-2l2xn", "apiVersion": "v1", "subresource": "eviction" }, "responseStatus": { "metadata": {}, "status": "Success", "code": 201 ## Even if they come from a daemonset $ oc get ds -n openshift-cluster-csi-drivers NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE aws-ebs-csi-driver-node 8 8 8 8 8 kubernetes.io/os=linux 146m aws-efs-csi-driver-node 8 8 8 8 8 kubernetes.io/os=linux 127m
Expected results:
DaemonSet Pods should not be evicted
Additional info:
- links to
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update