Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-2361

[4.18]ODF with RDR and multipath fails during upgrade from 4.17 to 4.18 - OSD migration fails as ceph-volume fails to fetch the multipath device

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • odf-4.18.7
    • odf-4.18
    • rook
    • None
    • Important
    • Proposed
    • None

       

      Description of problem -
       ** 

      Upgrade fails for OpenShift/ODF clusters 4.17, where RDR is enabled and using multipath.

      [client.admin]
      keyring = /var/lib/rook/openshift-storage/client.admin.keyring
      2025-03-07 14:01:38.402172 I | cephcmd: destroying osd.0 and cleaning its backing device
      2025-03-07 14:01:38.402508 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
      2025-03-07 14:01:39.578802 D | cephosd: {}
      2025-03-07 14:01:39.578963 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
      2025-03-07 14:01:39.579037 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list --format json
      2025-03-07 14:01:53.916338 D | cephosd: {
          "5858fb3b-278d-4332-be6d-1bcdada54327": {
              "ceph_fsid": "8206fd72-4080-4bf2-9ada-fa209686e101",
              "device": "/dev/sdh",
              "osd_id": 0,
              "osd_uuid": "5858fb3b-278d-4332-be6d-1bcdada54327",
              "type": "bluestore-rdr"
          }
      }
      2025-03-07 14:01:53.916771 I | cephosd: 1 ceph-volume raw osd devices configured on this node
      2025-03-07 14:01:53.916820 I | cephosd: destroying osd.0
      2025-03-07 14:01:53.916873 D | exec: Running command: ceph osd destroy osd.0 --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
      2025-03-07 14:01:54.678150 I | cephosd: successfully destroyed osd.0
      2025-03-07 14:01:54.678295 I | cephosd: zap OSD.0 path "/dev/sdh"
      2025-03-07 14:01:54.678336 D | exec: Running command: stdbuf -oL ceph-volume lvm zap /dev/sdh --destroy
      2025-03-07 14:01:56.700741 C | rookcmd: failed to destroy OSD 0.: fa

      Here the device should be "/dev/mapper/mpathb" as the disk is multipath enabled, therefore the zapping of the device fails inaddition to which  the OSD migration also fails.

      [client.admin]
      keyring = /var/lib/rook/openshift-storage/client.admin.keyring
      2025-03-07 14:01:38.402172 I | cephcmd: destroying osd.0 and cleaning its backing device
      2025-03-07 14:01:38.402508 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
      2025-03-07 14:01:39.578802 D | cephosd: {}
      2025-03-07 14:01:39.578963 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
      2025-03-07 14:01:39.579037 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list --format json
      2025-03-07 14:01:53.916338 D | cephosd: {
          "5858fb3b-278d-4332-be6d-1bcdada54327": {
              "ceph_fsid": "8206fd72-4080-4bf2-9ada-fa209686e101",
              "device": "/dev/sdh",
              "osd_id": 0,
              "osd_uuid": "5858fb3b-278d-4332-be6d-1bcdada54327",
              "type": "bluestore-rdr"
          }
      }
      2025-03-07 14:01:53.916771 I | cephosd: 1 ceph-volume raw osd devices configured on this node
      2025-03-07 14:01:53.916820 I | cephosd: destroying osd.0
      2025-03-07 14:01:53.916873 D | exec: Running command: ceph osd destroy osd.0 --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
      2025-03-07 14:01:54.678150 I | cephosd: successfully destroyed osd.0
      2025-03-07 14:01:54.678295 I | cephosd: zap OSD.0 path "/dev/sdh"
      2025-03-07 14:01:54.678336 D | exec: Running command: stdbuf -oL ceph-volume lvm zap /dev/sdh --destroy
      2025-03-07 14:01:56.700741 C | rookcmd: failed to destroy OSD 0.: fa

      Here the device should be "/dev/mapper/mpathb" as the disk is multipath enabled, therefore the zapping of the device fails inaddition to which  the OSD migration also fails.

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      Baremetal (IBM Z)

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Internal (LSO), RDR environment

       

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      OCP: 4.18.3
      ODF: 4.18.0-rhodf
      ACM: 2.12.2

      Is there any workaround available to the best of your knowledge?

      No

       

      Can this issue be reproduced? If so, please provide the hit rate

      Yes, always

       

      Can this issue be reproduced from the UI?

      Yes

      If this is a regression, please provide more details to justify this:

       

      Steps to Reproduce:

      1. Deploy ODF 4.17.5-rhodf with multipath devices on IBM Z environment/x86 

      2. Upgrade to 4.18.0-rhodf

      3. Observe the migration status of OSDs and you can see that one of the OSD which has been attempted for migration is in Crashloopbackoff state

       

      The exact date and time when the issue was observed, including timezone details:

       

      Actual results:

      OSD migration fails and is in Crashloopbackoff state

       

      Expected results:

      OSD migration should be successful and the bluestore should change from bluestore-rdr to bluestore

      Logs collected and log location:

       

      Additional info:

      **

       

      # oc logs rook-ceph-osd-prepare-243d93cc209f61f44545ac4620752c4b-gsmxb
      Defaulted container "provision" out of: provision, copy-bins (init), blkdevmapper (init)
      mon_data_avail_warn                = 15
      mon_warn_on_pool_no_redundancy     = false
      bluestore_prefer_deferred_size_hdd = 0
      [osd]
      osd_memory_target_cgroup_limit_ratio = 0.8
      [client.admin]
      keyring = /var/lib/rook/openshift-storage/client.admin.keyring
      2025-03-07 14:01:38.402172 I | cephcmd: destroying osd.0 and cleaning its backing device
      2025-03-07 14:01:38.402508 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
      2025-03-07 14:01:39.578802 D | cephosd: {}
      2025-03-07 14:01:39.578963 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
      2025-03-07 14:01:39.579037 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list --format json
      2025-03-07 14:01:53.916338 D | cephosd: {
          "5858fb3b-278d-4332-be6d-1bcdada54327": {
              "ceph_fsid": "8206fd72-4080-4bf2-9ada-fa209686e101",
              "device": "/dev/sdh",
              "osd_id": 0,
              "osd_uuid": "5858fb3b-278d-4332-be6d-1bcdada54327",
              "type": "bluestore-rdr"
          }
      }
      2025-03-07 14:01:53.916771 I | cephosd: 1 ceph-volume raw osd devices configured on this node
      2025-03-07 14:01:53.916820 I | cephosd: destroying osd.0
      2025-03-07 14:01:53.916873 D | exec: Running command: ceph osd destroy osd.0 --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
      2025-03-07 14:01:54.678150 I | cephosd: successfully destroyed osd.0
      2025-03-07 14:01:54.678295 I | cephosd: zap OSD.0 path "/dev/sdh"
      2025-03-07 14:01:54.678336 D | exec: Running command: stdbuf -oL ceph-volume lvm zap /dev/sdh --destroy
      2025-03-07 14:01:56.700741 C | rookcmd: failed to destroy OSD 0.: fa
      Traceback (most recent call last):
        File "/usr/sbin/ceph-volume", line 33, in <module>
          sys.exit(load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')())
        File "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 54, in __init__
          self.main(self.argv)
        File "/usr/lib/python3.9/site-packages/ceph_volume/decorators.py", line 59, in newfunc
          return f(*a, **kw)
        File "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 166, in main
          terminal.dispatch(self.mapper, subcommand_args)
        File "/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line 194, in dispatch
          instance.main()
        File "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/main.py", line 46, in main
          terminal.dispatch(self.mapper, self.argv)
        File "/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line 194, in dispatch
          instance.main()
        File "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/zap.py", line 431, in main
          self.zap()
        File "/usr/lib/python3.9/site-packages/ceph_volume/decorators.py", line 16, in is_root
          return func(*a, **kw)
        File "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/zap.py", line 307, in zap
          self.zap_raw_device(device)
        File "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/zap.py", line 289, in zap_raw_device
          zap_device(device.path)
        File "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/zap.py", line 25, in zap_device
          zap_bluestore(path)
        File "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/zap.py", line 36, in zap_bluestore
          process.run([
        File "/usr/lib/python3.9/site-packages/ceph_volume/process.py", line 147, in run
          raise RuntimeError(msg)
      RuntimeError: command returned non-zero exit status: 1.: exit status 1
      

       

       

      sh-5.1# ceph-volume raw list --format json
      {
          "230473aa-4d5a-406e-89ca-02e47bf7a98f": {
              "ceph_fsid": "6fdb7424-efc1-4916-a9d5-5597b9be8d87",
              "device": "/dev/loop1",
              "osd_id": 1,
              "osd_uuid": "230473aa-4d5a-406e-89ca-02e47bf7a98f",
              "type": "bluestore-rdr"
          },
          "643a3711-8d53-457d-8c25-f633d35026a5": {
              "ceph_fsid": "4db013b4-272a-4fc6-beeb-d0fdac6cbbc5",
              "device": "/dev/sdh",
              "osd_id": 2,
              "osd_uuid": "643a3711-8d53-457d-8c25-f633d35026a5",
              "type": "bluestore-rdr"
          }
      }

       

      2025-03-10 10:48:46.421588 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-localblock-0-data-16rtbs --format json
      2025-03-10 10:48:46.827620 D | cephosd: {
          "230473aa-4d5a-406e-89ca-02e47bf7a98f": {
              "ceph_fsid": "6fdb7424-efc1-4916-a9d5-5597b9be8d87",
              "device": "/dev/mapper/mpathb",
              "osd_id": 1,
              "osd_uuid": "230473aa-4d5a-406e-89ca-02e47bf7a98f",
              "type": "bluestore-rdr"
          }
      }

        1. osd-0-prepare.log
          15 kB
        2. osd-1-prepare.log
          15 kB
        3. osd-2-prepare.log
          15 kB

              sapillai Santosh Pillai
              sravikab2 Sravika Balusu
              Santosh Pillai
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

                Created:
                Updated:
                Resolved: