-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.13
Description of problem:
OpenShift Container Platform 4 with clusterresourceoverride-operator experiencing nodeCondition=[DiskPressure] will see frequent errors when pod are being created or updated as the respective pod can not be reached (as it's being evicted). Below is an easy way to trigger the condition without the need to create diskPressure condition (but simply simulating constant restart of the resourceoverride pods). $ cat /tmp/pod.yaml apiVersion: v1 kind: Pod metadata: name: hostname spec: containers: - name: hostname image: quay.io/rhn_support_sreber/hostname:latest resources: requests: memory: "128Mi" cpu: "250m" limits: memory: "256Mi" cpu: "500m" $ oc get ns project-1 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/description: "" openshift.io/display-name: "" openshift.io/requester: kube:admin openshift.io/sa.scc.mcs: s0:c26,c20 openshift.io/sa.scc.supplemental-groups: 1000690000/10000 openshift.io/sa.scc.uid-range: 1000690000/10000 creationTimestamp: "2023-07-28T11:06:57Z" labels: clusterresourceoverrides.admission.autoscaling.openshift.io/enabled: "true" kubernetes.io/metadata.name: project-1 pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/audit-version: v1.24 pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/warn-version: v1.24 name: project-1 resourceVersion: "558511" uid: 75cc7d03-3f6d-4c9e-a5f5-a7a2435a89f1 spec: finalizers: - kubernetes status: phase: Active $ oc get pod -n resourceoverride NAME READY STATUS RESTARTS AGE clusterresourceoverride-2rb6p 1/1 Running 0 8m23s clusterresourceoverride-4kd2k 1/1 Running 0 4m58s clusterresourceoverride-bqbjh 1/1 Running 0 5s clusterresourceoverride-operator-595cc699cf-bp6t5 1/1 Running 0 142m $ while true; do POD=`oc get pod -n resourceoverride | grep -v clusterresourceoverride-operator | grep -v "^NAME" | tail -1 | awk '{print $1}'`; oc delete pod $POD -n resourceoverride; sleep 3; done pod "clusterresourceoverride-4xwz7" deleted pod "clusterresourceoverride-vx7g5" deleted pod "clusterresourceoverride-7f7f6" deleted pod "clusterresourceoverride-59fwc" deleted pod "clusterresourceoverride-mfk6z" deleted pod "clusterresourceoverride-zs77p" deleted pod "clusterresourceoverride-wdzwv" deleted pod "clusterresourceoverride-btf79" deleted pod "clusterresourceoverride-jvrjx" deleted pod "clusterresourceoverride-wmlpn" deleted pod "clusterresourceoverride-q4zjt" deleted pod "clusterresourceoverride-wr285" deleted pod "clusterresourceoverride-z9hsn" deleted pod "clusterresourceoverride-hfwcg" deleted pod "clusterresourceoverride-5dnzk" deleted pod "clusterresourceoverride-9cdtn" deleted pod "clusterresourceoverride-k2cdv" deleted pod "clusterresourceoverride-9qtpq" deleted pod "clusterresourceoverride-tb2qk" deleted pod "clusterresourceoverride-bqbjh" deleted $ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done [...] pod "hostname" deleted Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused Error from server (NotFound): pods "hostname" not found Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused Error from server (NotFound): pods "hostname" not found pod/hostname created pod "hostname" deleted pod/hostname created pod "hostname" deleted So clusterresourceoverride pods should either have a priorityClass assigned that prevents the pods from being evicted or potentially even better, call a service instead of localhost to make sure an endpoint is called that is considered healthy and running.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.11 and OpenShift Container Platform 4.13
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 2. Install ClusterResourceOverride according to https://docs.openshift.com/container-platform/4.13/nodes/clusters/nodes-cluster-overcommit.html 3. Create either diskPressure on one of the OpenShift Container Platform 4 - Control-Plane Node(s) or simply delete one of the pods constantly to simulate a diskPressure situation (as shown in the description)
Actual results:
$ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done [...] pod "hostname" deleted Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused Error from server (NotFound): pods "hostname" not found Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused Error from server (NotFound): pods "hostname" not found
Expected results:
Pods being created and update as long as one of the 3 clusterresourceoverride pods is running no matter on what OpenShift Container Platform 4 - Control-Plane Node
Additional info: