Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Cluster Resource Override Admission Operator
Labels:
- bug
- design
- kube-apiserver
- triaged
- webhooks

Severity:
Moderate
Regression:
No
Epic Link:
PODAUTO-169
Story Points:
3
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Review Complete:

Description of problem:

OpenShift Container Platform 4 with clusterresourceoverride-operator experiencing nodeCondition=[DiskPressure] will see frequent errors when pod are being created or updated as the respective pod can not be reached (as it's being evicted).

Below is an easy way to trigger the condition without the need to create diskPressure condition (but simply simulating constant restart of the resourceoverride pods).

$ cat /tmp/pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: hostname
spec:
  containers:
  - name: hostname
    image: quay.io/rhn_support_sreber/hostname:latest
    resources:
      requests:
        memory: "128Mi"
        cpu: "250m"
      limits:
        memory: "256Mi"
        cpu: "500m"

$ oc get ns project-1 -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: kube:admin
    openshift.io/sa.scc.mcs: s0:c26,c20
    openshift.io/sa.scc.supplemental-groups: 1000690000/10000
    openshift.io/sa.scc.uid-range: 1000690000/10000
  creationTimestamp: "2023-07-28T11:06:57Z"
  labels:
    clusterresourceoverrides.admission.autoscaling.openshift.io/enabled: "true"
    kubernetes.io/metadata.name: project-1
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: v1.24
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.24
  name: project-1
  resourceVersion: "558511"
  uid: 75cc7d03-3f6d-4c9e-a5f5-a7a2435a89f1
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

$ oc get pod -n resourceoverride
NAME                                                READY   STATUS    RESTARTS   AGE
clusterresourceoverride-2rb6p                       1/1     Running   0          8m23s
clusterresourceoverride-4kd2k                       1/1     Running   0          4m58s
clusterresourceoverride-bqbjh                       1/1     Running   0          5s
clusterresourceoverride-operator-595cc699cf-bp6t5   1/1     Running   0          142m

$ while true; do POD=`oc get pod -n resourceoverride | grep -v clusterresourceoverride-operator | grep -v "^NAME" | tail -1 | awk '{print $1}'`; oc delete pod $POD -n resourceoverride; sleep 3; done
pod "clusterresourceoverride-4xwz7" deleted
pod "clusterresourceoverride-vx7g5" deleted
pod "clusterresourceoverride-7f7f6" deleted
pod "clusterresourceoverride-59fwc" deleted
pod "clusterresourceoverride-mfk6z" deleted
pod "clusterresourceoverride-zs77p" deleted
pod "clusterresourceoverride-wdzwv" deleted
pod "clusterresourceoverride-btf79" deleted
pod "clusterresourceoverride-jvrjx" deleted
pod "clusterresourceoverride-wmlpn" deleted
pod "clusterresourceoverride-q4zjt" deleted
pod "clusterresourceoverride-wr285" deleted
pod "clusterresourceoverride-z9hsn" deleted
pod "clusterresourceoverride-hfwcg" deleted
pod "clusterresourceoverride-5dnzk" deleted
pod "clusterresourceoverride-9cdtn" deleted
pod "clusterresourceoverride-k2cdv" deleted
pod "clusterresourceoverride-9qtpq" deleted
pod "clusterresourceoverride-tb2qk" deleted
pod "clusterresourceoverride-bqbjh" deleted

$ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done
[...]
pod "hostname" deleted
Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
Error from server (NotFound): pods "hostname" not found
Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
Error from server (NotFound): pods "hostname" not found
pod/hostname created
pod "hostname" deleted
pod/hostname created
pod "hostname" deleted

So clusterresourceoverride pods should either have a priorityClass assigned that prevents the pods from being evicted or potentially even better, call a service instead of localhost to make sure an endpoint is called that is considered healthy and running.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.11 and OpenShift Container Platform 4.13

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4
2. Install ClusterResourceOverride according to https://docs.openshift.com/container-platform/4.13/nodes/clusters/nodes-cluster-overcommit.html
3. Create either diskPressure on one of the OpenShift Container Platform 4 - Control-Plane Node(s) or simply delete one of the pods constantly to simulate a diskPressure situation (as shown in the description)

Actual results:

$ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done
[...]
pod "hostname" deleted
Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
Error from server (NotFound): pods "hostname" not found
Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
Error from server (NotFound): pods "hostname" not found

Expected results:

Pods being created and update as long as one of the 3 clusterresourceoverride pods is running no matter on what OpenShift Container Platform 4 - Control-Plane Node

Additional info:

links to

Failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io" in OpenShift Container Platform 4 when DiskPressure condition is met

Assignee:: Max Cao

Reporter:: Simon Reber

QA Contact:: Sunil Choudhary

Need Info From:: Joel Smith

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/07/28 11:59 AM

Updated:: 2024/08/05 7:07 AM

Resolved:: 2024/06/22 8:47 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates