-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.14
-
Important
-
No
-
4
-
False
-
Description of problem:
During installing Infrastructure cluster of DPU cluster. After nodes are ready, install dpu operator and label nodes to add into dpu mcp. Sometimes dpu mcp degraded, one of the bf worker nodes hung in status of 'Ready,SchedulingDisabled'. Install dpu operator via sub will increase the reproduce frequency, and reboot bf worker nodes may make the nodes recover.
Version-Release number of selected component (if applicable):
4.14.0-ec.2
How reproducible:
Steps to Reproduce:
1. Install infrastructer cluster via tool https://github.com/bn222/cluster-deployment-automation and skip post config. 2. install dpu operator via sub 3. configure dpu related configuration # cat manifests/infra/tenantcluster-dpu.yaml apiVersion: v1 kind: Namespace metadata: labels: pod-security.kubernetes.io/enforce: privileged pod-security.kubernetes.io/audit: privileged pod-security.kubernetes.io/warn: privileged security.openshift.io/scc.podSecurityLabelSync: "false" openshift.io/run-level: "0" name: tenantcluster-dpu # cat manifests/infra/dpuclusterconfig.yaml apiVersion: dpu.openshift.io/v1alpha1 kind: DpuClusterConfig metadata: name: dpuclusterconfig-sample namespace: tenantcluster-dpu spec: poolName: dpu nodeSelector: matchLabels: node-role.kubernetes.io/dpu-worker: "" #oc patch mcp dpu --type=json -p='[{"op": "replace", "path": "/spec/maxUnavailable", "value":2"}]' 4. label nodes oc label node bf-worker0 node-role.kubernetes.io/dpu-worker= oc label node bf-worker1 node-role.kubernetes.io/dpu-worker= oc label node bf-worker0 network.operator.openshift.io/dpu= oc label node bf-worker1 network.operator.openshift.io/dpu= 5. dpu mcp will start update and may stay in updating and degraded status. # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE dpu False True True 2 0 0 1 150m master rendered-master-56780fd8d330e435efbab9537b437345 True False False 3 3 3 0 3h42m worker rendered-worker-94b8d439eb3c112161c2685ca897bc10 True False False 0 0 0 0 3h42m # oc get nodes NAME STATUS ROLES AGE VERSION bf-worker0 Ready dpu-worker,worker 168m v1.27.2+55f2dbe bf-worker1 Ready,SchedulingDisabled dpu-worker,worker 168m v1.27.2+55f2dbe infracluster-master-1 Ready control-plane,master,worker 3h22m v1.27.2+55f2dbe infracluster-master-2 Ready control-plane,master,worker 3h48m v1.27.2+55f2dbe infracluster-master-3 Ready control-plane,master,worker 3h49m v1.27.2+55f2dbe
Actual results:
mcp update fail
Expected results:
mcp should update successfully
Additional info: