Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-637

[2254035] OSD pods scheduling is inconsistent after adding osd placement spec

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.13
    • ocs-operator
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):
      In IBMCloud ROKS cluster, we were validating the multiple deviceset features and are observing inconsistency in OSD pod scheduling. We are following this article to create devicesets

      https://access.redhat.com/articles/6214381

      This issue has been observed on both 4.13 & 4.14 ROKS clusters with the respective latest version of ODF.

      We have created a ROKS clusters with 3 workers of flavors 16x64G initially from the IBMCloud and after the cluster creation, have installed our addon to install the ODF. This by default installs ODF by creating a single deviceset with name "ocs-deviceset" and storage class as "ibmc-vpc-block-metro-10iops-tier" and all the OSD pods are evenly spread across the available workers.
      ##########################################

      • config: {}
        count: 1
        dataPVCTemplate:
        metadata: {}
        spec:
        accessModes:
      • ReadWriteOnce
        resources:
        requests:
        storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-10iops-tier
        volumeMode: Block
        status: {}
        name: ocs-deviceset
        placement: {}
        portable: true
        preparePlacement: {}
        replica: 3
        resources: {}
        ###########################################
        Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=default
        NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
        10.241.0.7 Ready master,worker 8h v1.26.9+aa37255 10.241.0.7 10.241.0.7 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.128.7 Ready master,worker 8h v1.26.9+aa37255 10.241.128.7 10.241.128.7 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.64.6 Ready master,worker 8h v1.26.9+aa37255 10.241.64.6 10.241.64.6 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        ##########################################
        rook-ceph-osd-0-794885b46f-c2dx8 2/2 Running 0 7h39m 172.17.89.250 10.241.64.6 <none> <none>
        rook-ceph-osd-1-8699d65d57-88z2g 2/2 Running 0 7h39m 172.17.66.223 10.241.128.7 <none> <none>
        rook-ceph-osd-2-6b48c9b99-k8tb6 2/2 Running 0 7h38m 172.17.68.230 10.241.0.7 <none> <none>
        ##########################################

      Lets add another deviceset by editing the storagecluster cr as per the above article except that the deviceClass parameter is not added with storage class as "ibmc-vpc-block-metro-5iops-tier". In this case, the OSD pods are getting scheduled on the above listed nodes and are being spread across the zones.
      ##########################################

      • config: {}
        count: 1
        dataPVCTemplate:
        metadata: {}
        spec:
        accessModes:
      • ReadWriteOnce
        resources:
        requests:
        storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-5iops-tier
        volumeMode: Block
        status: {}
        name: ocs-deviceset-2
        placement: {}
        portable: true
        preparePlacement: {}
        replica: 3
        resources: {}
        ##########################################
        rook-ceph-osd-3-549df4f77d-l7w5s 2/2 Running 0 7h8m 172.17.89.249 10.241.64.6 <none> <none>
        rook-ceph-osd-4-56464899-qk2bl 2/2 Running 0 7h8m 172.17.66.232 10.241.128.7 <none> <none>
        rook-ceph-osd-5-7bb8c4b8c4-zszfr 2/2 Running 0 7h7m 172.17.68.238 10.241.0.7 <none> <none>
        ##########################################

      Now create a worker pool of 3 workers from IBMCloud UI with name "deviceset-3" and add the following labels. Lets create another deviceset with deviceClass as "deviceset-3", storage class as "ibmc-vpc-block-metro-5iops-tier" and placement policies as well. In this case, the OSD pods are either gets scheduled across any 2 zones or on any one of the workers based on the affinity condition.
      ##########################################
      cluster.ocs.openshift.io/openshift-storage: ""
      cluster.ocs.openshift.io/openshift-storage-device-class: deviceset-3
      ##########################################

      • config: {}
        count: 1
        dataPVCTemplate:
        metadata: {}
        spec:
        accessModes:
      • ReadWriteOnce
        resources:
        requests:
        storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-5iops-tier
        volumeMode: Block
        status: {}
        deviceClass: deviceset-3
        name: ocs-deviceset-3
        placement:
        nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
      • matchExpressions:
      • key: cluster.ocs.openshift.io/openshift-storage-device-class
        operator: In
        values:
      • deviceset-3
        ##########################################
        Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-3
        NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
        10.241.0.9 Ready master,worker 7h23m v1.26.9+aa37255 10.241.0.9 10.241.0.9 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.128.12 Ready master,worker 7h23m v1.26.9+aa37255 10.241.128.12 10.241.128.12 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.64.11 Ready master,worker 7h23m v1.26.9+aa37255 10.241.64.11 10.241.64.11 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        ##########################################
        rook-ceph-osd-6-6b456f7844-jp4x2 2/2 Running 0 6h14m 172.17.110.209 10.241.64.11 <none> <none>
        rook-ceph-osd-7-55b98ff548-v4rsh 2/2 Running 0 6h14m 172.17.110.212 10.241.64.11 <none> <none>
        rook-ceph-osd-8-b45474c5f-6vnqv 2/2 Running 0 6h13m 172.17.110.214 10.241.64.11 <none> <none>
        ##########################################

      Same steps as previous scenario but with different storage class "ibmc-vpc-block-metro-general-purpose". In this case, the OSD pods are all distributed across zone as expected
      ##########################################

      • config: {}
        count: 1
        dataPVCTemplate:
        metadata: {}
        spec:
        accessModes:
      • ReadWriteOnce
        resources:
        requests:
        storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-general-purpose
        volumeMode: Block
        status: {}
        deviceClass: deviceset-4
        name: ocs-deviceset-4
        placement:
        nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
      • matchExpressions:
      • key: cluster.ocs.openshift.io/openshift-storage-device-class
        operator: In
        values:
      • deviceset-4
        portable: true
        preparePlacement: {}
        replica: 3
        resources: {}
        ##########################################
        Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-4
        NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
        10.241.0.10 Ready master,worker 7h57m v1.26.9+aa37255 10.241.0.10 10.241.0.10 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.128.13 Ready master,worker 7h56m v1.26.9+aa37255 10.241.128.13 10.241.128.13 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.64.12 Ready master,worker 7h57m v1.26.9+aa37255 10.241.64.12 10.241.64.12 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        ##########################################
        rook-ceph-osd-9-67dd868dc8-jhw4q 2/2 Running 0 4h56m 172.17.116.72 10.241.128.13 <none> <none>
        rook-ceph-osd-10-54d5b69df5-mvvzj 2/2 Running 0 4h56m 172.17.125.8 10.241.64.12 <none> <none>
        rook-ceph-osd-11-548ff94bdb-sp7cv 2/2 Running 0 4h56m 172.17.75.137 10.241.0.10 <none> <none>
        ##########################################

      Same step as previous scenario with same storage class "ibmc-vpc-block-metro-general-purpose". In this case, the OSD pods are distributed unevenly with 2 OSDs on same zone.
      ##########################################

      • config: {}
        count: 1
        dataPVCTemplate:
        metadata: {}
        spec:
        accessModes:
      • ReadWriteOnce
        resources:
        requests:
        storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-general-purpose
        volumeMode: Block
        status: {}
        deviceClass: deviceset-5
        name: ocs-deviceset-5
        placement:
        nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
      • matchExpressions:
      • key: cluster.ocs.openshift.io/openshift-storage-device-class
        operator: In
        values:
      • deviceset-5
        portable: true
        preparePlacement: {}
        replica: 3
        resources: {}
        ##########################################
        Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-5
        NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
        10.241.0.11 Ready master,worker 6h30m v1.26.9+aa37255 10.241.0.11 10.241.0.11 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.128.14 Ready master,worker 6h30m v1.26.9+aa37255 10.241.128.14 10.241.128.14 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        10.241.64.13 Ready master,worker 6h30m v1.26.9+aa37255 10.241.64.13 10.241.64.13 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
        ##########################################
        rook-ceph-osd-12-6fc6c68645-cwdwz 2/2 Running 0 4h1m 172.17.91.201 10.241.64.13 <none> <none>
        rook-ceph-osd-13-6f6cb46d4f-55xsz 2/2 Running 0 4h1m 172.17.91.203 10.241.64.13 <none> <none>
        rook-ceph-osd-14-7988b69947-csrkl 2/2 Running 0 4h 172.17.103.72 10.241.0.11 <none> <none>
        ##########################################

      Version of all relevant components (if applicable):
      Latest ODF 4.13 & 4.14

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      We are assessing the multiple deviceset feature for customers from ODF on IBMCloud

      Is there any workaround available to the best of your knowledge?
      If we include PodAntiAffinity rules on OSD Prepare jobs, OSD pods are being scheduled as expected
      ##############################################
      placement:
      nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:

      • matchExpressions:
      • key: cluster.ocs.openshift.io/openshift-storage-device-class
        operator: In
        values:
      • deviceset-8
        podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
      • podAffinityTerm:
        labelSelector:
        matchExpressions:
      • key: app
        operator: In
        values:
      • rook-ceph-osd
      • rook-ceph-osd-prepare
        topologyKey: topology.kubernetes.io/zone
        weight: 100
        ##############################################

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      3

      Can this issue reproducible?
      Yes, Tried in 2 clusters in 2 different env.
      4.13 on Prod Env
      4.14 on Internal Stage Env

      Can this issue reproduce from the UI?
      NA

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      All scenarios are detailed in the description

      Actual results:
      OSD pods should be scheduled across nodes from different zones

      Expected results:
      Works only when SC are different in device sets.

      Additional info:
      NA

              mparida@redhat.com Malay Kumar Parida
              jira-bugzilla-migration RH Bugzilla Integration
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: