-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.13
-
Quality / Stability / Reliability
-
False
-
-
3
-
Moderate
-
No
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
OpenShift Container Platform 4 with clusterresourceoverride-operator experiencing nodeCondition=[DiskPressure] will see frequent errors when pod are being created or updated as the respective pod can not be reached (as it's being evicted).
Below is an easy way to trigger the condition without the need to create diskPressure condition (but simply simulating constant restart of the resourceoverride pods).
$ cat /tmp/pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: hostname
spec:
containers:
- name: hostname
image: quay.io/rhn_support_sreber/hostname:latest
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "500m"
$ oc get ns project-1 -o yaml
apiVersion: v1
kind: Namespace
metadata:
annotations:
openshift.io/description: ""
openshift.io/display-name: ""
openshift.io/requester: kube:admin
openshift.io/sa.scc.mcs: s0:c26,c20
openshift.io/sa.scc.supplemental-groups: 1000690000/10000
openshift.io/sa.scc.uid-range: 1000690000/10000
creationTimestamp: "2023-07-28T11:06:57Z"
labels:
clusterresourceoverrides.admission.autoscaling.openshift.io/enabled: "true"
kubernetes.io/metadata.name: project-1
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: v1.24
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: v1.24
name: project-1
resourceVersion: "558511"
uid: 75cc7d03-3f6d-4c9e-a5f5-a7a2435a89f1
spec:
finalizers:
- kubernetes
status:
phase: Active
$ oc get pod -n resourceoverride
NAME READY STATUS RESTARTS AGE
clusterresourceoverride-2rb6p 1/1 Running 0 8m23s
clusterresourceoverride-4kd2k 1/1 Running 0 4m58s
clusterresourceoverride-bqbjh 1/1 Running 0 5s
clusterresourceoverride-operator-595cc699cf-bp6t5 1/1 Running 0 142m
$ while true; do POD=`oc get pod -n resourceoverride | grep -v clusterresourceoverride-operator | grep -v "^NAME" | tail -1 | awk '{print $1}'`; oc delete pod $POD -n resourceoverride; sleep 3; done
pod "clusterresourceoverride-4xwz7" deleted
pod "clusterresourceoverride-vx7g5" deleted
pod "clusterresourceoverride-7f7f6" deleted
pod "clusterresourceoverride-59fwc" deleted
pod "clusterresourceoverride-mfk6z" deleted
pod "clusterresourceoverride-zs77p" deleted
pod "clusterresourceoverride-wdzwv" deleted
pod "clusterresourceoverride-btf79" deleted
pod "clusterresourceoverride-jvrjx" deleted
pod "clusterresourceoverride-wmlpn" deleted
pod "clusterresourceoverride-q4zjt" deleted
pod "clusterresourceoverride-wr285" deleted
pod "clusterresourceoverride-z9hsn" deleted
pod "clusterresourceoverride-hfwcg" deleted
pod "clusterresourceoverride-5dnzk" deleted
pod "clusterresourceoverride-9cdtn" deleted
pod "clusterresourceoverride-k2cdv" deleted
pod "clusterresourceoverride-9qtpq" deleted
pod "clusterresourceoverride-tb2qk" deleted
pod "clusterresourceoverride-bqbjh" deleted
$ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done
[...]
pod "hostname" deleted
Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
Error from server (NotFound): pods "hostname" not found
Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
Error from server (NotFound): pods "hostname" not found
pod/hostname created
pod "hostname" deleted
pod/hostname created
pod "hostname" deleted
So clusterresourceoverride pods should either have a priorityClass assigned that prevents the pods from being evicted or potentially even better, call a service instead of localhost to make sure an endpoint is called that is considered healthy and running.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.11 and OpenShift Container Platform 4.13
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 2. Install ClusterResourceOverride according to https://docs.openshift.com/container-platform/4.13/nodes/clusters/nodes-cluster-overcommit.html 3. Create either diskPressure on one of the OpenShift Container Platform 4 - Control-Plane Node(s) or simply delete one of the pods constantly to simulate a diskPressure situation (as shown in the description)
Actual results:
$ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done [...] pod "hostname" deleted Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused Error from server (NotFound): pods "hostname" not found Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused Error from server (NotFound): pods "hostname" not found
Expected results:
Pods being created and update as long as one of the 3 clusterresourceoverride pods is running no matter on what OpenShift Container Platform 4 - Control-Plane Node
Additional info: