Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16387

dpu mcp degraded after applied DpuClusterConfig and mcp patch

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.14
    • Networking / SR-IOV
    • Important
    • No
    • 4
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      During installing Infrastructure cluster of DPU cluster. After nodes are ready, install dpu operator and label nodes to add into dpu mcp. Sometimes dpu mcp degraded, one of the bf worker nodes hung in status of 'Ready,SchedulingDisabled'. 
      
      Install dpu operator via sub will increase the reproduce frequency, and reboot bf worker nodes may make the nodes recover.

       

      Version-Release number of selected component (if applicable):

      4.14.0-ec.2

      How reproducible:

       

      Steps to Reproduce:

      1. Install infrastructer cluster via tool https://github.com/bn222/cluster-deployment-automation and skip post config.
      2. install dpu operator via sub
      3. configure dpu related configuration
      # cat manifests/infra/tenantcluster-dpu.yaml
      apiVersion: v1
      kind: Namespace
      metadata:
        labels:
          pod-security.kubernetes.io/enforce: privileged
          pod-security.kubernetes.io/audit: privileged
          pod-security.kubernetes.io/warn: privileged
          security.openshift.io/scc.podSecurityLabelSync: "false"
          openshift.io/run-level: "0"
        name: tenantcluster-dpu
      # cat manifests/infra/dpuclusterconfig.yaml
      apiVersion: dpu.openshift.io/v1alpha1
      kind: DpuClusterConfig
      metadata:
        name: dpuclusterconfig-sample
        namespace: tenantcluster-dpu
      spec:
        poolName: dpu
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/dpu-worker: "" 
      
      #oc patch mcp dpu --type=json -p='[{"op": "replace", "path": "/spec/maxUnavailable", "value":2"}]' 
      
      4. label nodes
        oc label node bf-worker0 node-role.kubernetes.io/dpu-worker=
        oc label node bf-worker1 node-role.kubernetes.io/dpu-worker=
        
        oc label node bf-worker0 network.operator.openshift.io/dpu=
        oc label node bf-worker1 network.operator.openshift.io/dpu=
      
      5. dpu mcp will start update and may stay in updating and degraded status.
      # oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      dpu                                                         False     True       True       2              0                   0                     1                      150m
      master   rendered-master-56780fd8d330e435efbab9537b437345   True      False      False      3              3                   3                     0                      3h42m
      worker   rendered-worker-94b8d439eb3c112161c2685ca897bc10   True      False      False      0              0                   0                     0                      3h42m
      
      # oc get nodes
      NAME                    STATUS                     ROLES                         AGE     VERSION
      bf-worker0              Ready                      dpu-worker,worker             168m    v1.27.2+55f2dbe
      bf-worker1              Ready,SchedulingDisabled   dpu-worker,worker             168m    v1.27.2+55f2dbe
      infracluster-master-1   Ready                      control-plane,master,worker   3h22m   v1.27.2+55f2dbe
      infracluster-master-2   Ready                      control-plane,master,worker   3h48m   v1.27.2+55f2dbe
      infracluster-master-3   Ready                      control-plane,master,worker   3h49m   v1.27.2+55f2dbe 

      Actual results:

      mcp update fail 

      Expected results:

      mcp should update successfully

      Additional info:

       

            sdaniele@redhat.com Salvatore Daniele
            rhn-support-yingwang Ying Wang
            Ying Wang Ying Wang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: