Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-920

[RDR] Migration of osd from bluestore-rdr to bluestore is stopped

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • rook
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • Proposed
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      [RDR] Migration of osd from bluestore-rdr to bluestore is stopped

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

       Vmware-UPI

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

       RDR

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

       OCP -4.18
      ODF- 4.18
      ceph:- {
      "mon":

      { "ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)": 3 }

      ,
      "mgr":

      { "ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)": 2 }

      ,
      "osd":

      { "ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable)": 2, "ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)": 3 }

      ,
      "mds":

      { "ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)": 2 }

      ,
      "rbd-mirror":

      { "ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)": 1 }

      ,
      "rgw":

      { "ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)": 1 }

      ,
      "overall":

      { "ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable)": 2, "ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)": 12 }

      }

       

      Does this issue impact your ability to continue to work with the product?

       yes

       

      Is there any workaround available to the best of your knowledge?

       

       

      Can this issue be reproduced? If so, please provide the hit rate

       

       

      Can this issue be reproduced from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      1.Deploy RDR 4.17 Cluster

      2.Perform add capacity

      3.Upgrade OCP to 4.18 from 4.17

      4. Upgrade RDR from 4.17 to 4.18

      The exact date and time when the issue was observed, including timezone details:

       

      Actual results:

       only 3 disk were migrated and 3 did not

      oc get cephcluster -o yaml |grep bluestore
      type: bluestore
      bluestore: 3
      bluestore-rdr: 3
       ### snippet from rook logs

      2024-11-21 19:34:10.065644 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:34:15.184412 E | ceph-nodedaemon-controller: node reconcile failed: failed to create ceph-exporter metrics service: failed to create service rook-ceph-exporter. Internal error occurred: resource quota evaluation timed out
      2024-11-21 19:34:28.676025 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:160} {StateName:active+undersized+degraded+remapped+backfill_wait Count:8} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:34:51.891976 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:162} {StateName:active+undersized+degraded+remapped+backfill_wait Count:6} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:34:59.390948 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:162} {StateName:active+undersized+degraded+remapped+backfill_wait Count:6} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:35:10.064635 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:35:30.121899 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:162} {StateName:active+undersized+degraded+remapped+backfill_wait Count:6} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:35:52.362217 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:163} {StateName:active+undersized+degraded+remapped+backfill_wait Count:5} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:36:00.732123 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:163} {StateName:active+undersized+degraded+remapped+backfill_wait Count:5} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:36:10.065530 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:36:31.396184 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:163} {StateName:active+undersized+degraded+remapped+backfill_wait Count:5} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:36:52.495706 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:166} {StateName:active+undersized+degraded+remapped+backfill_wait Count:2} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:37:01.953644 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:167} {StateName:active+undersized+degraded+remapped+backfill_wait Count:1} {StateName:active+undersized+degraded+remapped+backfilling Count:1}]"
      2024-11-21 19:37:10.065134 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:37:32.622636 I | clusterdisruption-controller: all PGs are active+clean. Restoring default OSD pdb settings
      2024-11-21 19:37:32.622812 I | clusterdisruption-controller: creating the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
      2024-11-21 19:37:32.664172 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-rack-rack1" with maxUnavailable=0 for "rack" failure domain "rack1"
      2024-11-21 19:37:32.677792 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-rack-rack2" with maxUnavailable=0 for "rack" failure domain "rack2"
      2024-11-21 19:38:10.065589 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:39:10.064666 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:40:10.065554 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:41:10.065576 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:42:10.065328 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:43:10.065312 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:44:10.065513 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:45:10.064839 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:46:10.065625 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:47:10.064786 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:48:10.065417 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:49:10.065659 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:50:10.064705 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:51:10.065698 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:52:10.065527 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:53:10.065237 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:54:10.064653 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:55:10.064844 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:56:10.064635 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:57:10.064910 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:58:10.065671 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 19:59:10.064731 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:00:10.064634 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:01:10.065276 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:02:10.065290 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:03:10.064634 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:04:10.064705 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:05:10.065685 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:06:10.065626 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:07:10.065526 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:08:10.065294 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:09:10.065439 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:10:10.065650 I | op-osd: waiting... 0 of 1 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated
      2024-11-21 20:10:39.606836 W | op-mon: failed to get the list of monitor canary deployments. failed to list deployments with labelSelector app=rook-ceph-mon,mon_canary=true: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
      2024-11-21 20:10:49.651680 W | op-mon: failed to check mon health. failed to check if the service "c" is already exported: failed to get exported service "rook-ceph-mon-c": rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
      2024-11-21 20:11:07.829353 E | cephclient: failed to set ceph block pool "ocs-storagecluster-cephblockpool" mirroring status. failed to update object "openshift-storage/ocs-storagecluster-cephblockpool" status: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
      

      Expected results:

      All disk should get migrated
       

      Logs collected and log location:

       

      Additional info:

       

              sapillai Santosh Pillai
              prsurve@redhat.com Pratik Surve
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

                Created:
                Updated: