-
Bug
-
Resolution: Obsolete
-
Undefined
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
4
-
Important
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During installing Infrastructure cluster of DPU cluster. After nodes are ready, install dpu operator and label nodes to add into dpu mcp. Sometimes dpu mcp degraded, one of the bf worker nodes hung in status of 'Ready,SchedulingDisabled'. Install dpu operator via sub will increase the reproduce frequency, and reboot bf worker nodes may make the nodes recover.
Version-Release number of selected component (if applicable):
4.14.0-ec.2
How reproducible:
Steps to Reproduce:
1. Install infrastructer cluster via tool https://github.com/bn222/cluster-deployment-automation and skip post config.
2. install dpu operator via sub
3. configure dpu related configuration
# cat manifests/infra/tenantcluster-dpu.yaml
apiVersion: v1
kind: Namespace
metadata:
labels:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/warn: privileged
security.openshift.io/scc.podSecurityLabelSync: "false"
openshift.io/run-level: "0"
name: tenantcluster-dpu
# cat manifests/infra/dpuclusterconfig.yaml
apiVersion: dpu.openshift.io/v1alpha1
kind: DpuClusterConfig
metadata:
name: dpuclusterconfig-sample
namespace: tenantcluster-dpu
spec:
poolName: dpu
nodeSelector:
matchLabels:
node-role.kubernetes.io/dpu-worker: ""
#oc patch mcp dpu --type=json -p='[{"op": "replace", "path": "/spec/maxUnavailable", "value":2"}]'
4. label nodes
oc label node bf-worker0 node-role.kubernetes.io/dpu-worker=
oc label node bf-worker1 node-role.kubernetes.io/dpu-worker=
oc label node bf-worker0 network.operator.openshift.io/dpu=
oc label node bf-worker1 network.operator.openshift.io/dpu=
5. dpu mcp will start update and may stay in updating and degraded status.
# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
dpu False True True 2 0 0 1 150m
master rendered-master-56780fd8d330e435efbab9537b437345 True False False 3 3 3 0 3h42m
worker rendered-worker-94b8d439eb3c112161c2685ca897bc10 True False False 0 0 0 0 3h42m
# oc get nodes
NAME STATUS ROLES AGE VERSION
bf-worker0 Ready dpu-worker,worker 168m v1.27.2+55f2dbe
bf-worker1 Ready,SchedulingDisabled dpu-worker,worker 168m v1.27.2+55f2dbe
infracluster-master-1 Ready control-plane,master,worker 3h22m v1.27.2+55f2dbe
infracluster-master-2 Ready control-plane,master,worker 3h48m v1.27.2+55f2dbe
infracluster-master-3 Ready control-plane,master,worker 3h49m v1.27.2+55f2dbe
Actual results:
mcp update fail
Expected results:
mcp should update successfully
Additional info: