Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-728

[2319878] New crush rule causing rebalance

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Critical Critical
    • odf-4.16.4
    • None
    • rook
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      • We noticed that some new crush rules get pushed out with an update to ODF 4.16:

      .mgr_host_ssd
      ocs-storagecluster-cephblockpool_host_ssd
      .rgw.root_host_ssd
      ocs-storagecluster-cephobjectstore.rgw.otp_host_ssd
      ocs-storagecluster-cephobjectstore.rgw.meta_host_ssd
      ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_host_ssd
      ocs-storagecluster-cephobjectstore.rgw.buckets.index_host_ssd
      ocs-storagecluster-cephobjectstore.rgw.control_host_ssd
      ocs-storagecluster-cephobjectstore.rgw.log_host_ssd
      ocs-storagecluster-cephobjectstore.rgw.buckets.data_host_ssd
      ocs-storagecluster-cephfilesystem-metadata_host_ssd
      ocs-storagecluster-cephfilesystem-data0_host_ssd

      This has resulted in at least 1 case of a big rebalance after updating ODF:

      cluster:
      id: c745f785-45cc-4c32-b62d-67fd61b87321
      health: HEALTH_WARN
      1 nearfull osd(s)
      Low space hindering backfill (add storage if this doesn't resolve itself): 7 pgs backfill_toofull
      12 pool(s) nearfull
      1 daemons have recently crashed

      services:
      mon: 3 daemons, quorum a,c,e (age 13h)
      mgr: a(active, since 13h), standbys: b
      mds: 1/1 daemons up, 1 hot standby
      osd: 12 osds: 12 up (since 13h), 12 in (since 19M); 125 remapped pgs
      rgw: 1 daemon active (1 hosts, 1 zones)

      data:
      volumes: 1/1 healthy
      pools: 12 pools, 281 pgs
      objects: 8.31M objects, 8.9 TiB
      usage: 27 TiB used, 21 TiB / 48 TiB avail
      pgs: 12427371/24917634 objects misplaced (49.874%)
      156 active+clean
      116 active+remapped+backfill_wait
      7 active+remapped+backfill_wait+backfill_toofull
      2 active+remapped+backfilling

      • Here's the rook-ceph-operator logs during the event:

      2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807746 I | cephclient: creating a new crush rule for changed failure domain ("host"-->"rack") on crush rule "replicated_rule"
      2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807771 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "replicated_rule"
      2024-10-16T02:37:58.842770528Z 2024-10-16 02:37:58.842722 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephblockpool_rack"
      2024-10-16T02:40:58.441261067Z 2024-10-16 02:40:58.441216 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule ".rgw.root_rack"
      2024-10-16T02:40:58.456827394Z 2024-10-16 02:40:58.456778 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.meta_rack"
      2024-10-16T02:40:58.458467757Z 2024-10-16 02:40:58.458423 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.control_rack"
      2024-10-16T02:40:58.464400294Z 2024-10-16 02:40:58.464360 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.otp_rack"
      2024-10-16T02:40:58.470327637Z 2024-10-16 02:40:58.470228 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.index_rack"
      2024-10-16T02:40:58.477802276Z 2024-10-16 02:40:58.477762 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.log_rack"
      2024-10-16T02:40:58.484785130Z 2024-10-16 02:40:58.484742 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_rack"
      2024-10-16T02:41:03.696760054Z 2024-10-16 02:41:03.696712 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.data_rack"
      2024-10-16T02:43:30.629001666Z 2024-10-16 02:43:30.628941 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephfilesystem-metadata_rack"
      2024-10-16T02:43:34.196044470Z 2024-10-16 02:43:34.195994 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephfilesystem-data0_rack"
      2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807779 I | cephclient: crush rule "replicated_rule" will no longer be used by pool ".mgr"
      2024-10-16T02:37:58.842770528Z 2024-10-16 02:37:58.842744 I | cephclient: crush rule "ocs-storagecluster-cephblockpool_rack" will no longer be used by pool "ocs-storagecluster-cephblockpool"
      2024-10-16T02:40:58.441261067Z 2024-10-16 02:40:58.441238 I | cephclient: crush rule ".rgw.root_rack" will no longer be used by pool ".rgw.root"
      2024-10-16T02:40:58.456827394Z 2024-10-16 02:40:58.456810 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.meta_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.meta"
      2024-10-16T02:40:58.458467757Z 2024-10-16 02:40:58.458446 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.control_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.control"
      2024-10-16T02:40:58.464400294Z 2024-10-16 02:40:58.464385 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.otp_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.otp"
      2024-10-16T02:40:58.470327637Z 2024-10-16 02:40:58.470254 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.index_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.index"
      2024-10-16T02:40:58.477802276Z 2024-10-16 02:40:58.477785 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.log_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.log"
      2024-10-16T02:40:58.484785130Z 2024-10-16 02:40:58.484769 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec"
      2024-10-16T02:41:03.696760054Z 2024-10-16 02:41:03.696736 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.data_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.data"
      2024-10-16T02:43:30.629001666Z 2024-10-16 02:43:30.628966 I | cephclient: crush rule "ocs-storagecluster-cephfilesystem-metadata_rack" will no longer be used by pool "ocs-storagecluster-cephfilesystem-metadata"
      2024-10-16T02:43:34.196044470Z 2024-10-16 02:43:34.196026 I | cephclient: crush rule "ocs-storagecluster-cephfilesystem-data0_rack" will no longer be used by pool "ocs-storagecluster-cephfilesystem-data0"

      Version of all relevant components (if applicable):
      ODF 4.16

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Yes, it temporarily degrades the ODF cluster until backfilling completes

      Is there any workaround available to the best of your knowledge?

      No

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      3

      Can this issue reproducible?

      I haven't been able to reproduce it in the lab yet, but have tried once and will be trying again.

      Can this issue reproduce from the UI?

      No

      If this is a regression, please provide more details to justify this:

      No

      Actual results:
      Ceph has to go through a rebalance after upgrading to ODF 4.16

      Expected results:
      Ceph doesn't have to go through a rebalance after upgrading to ODF 4.16

      Additional info:
      If engineering can shed some insight on a) whether the rebalance is expected behaviour and b) if it is expected offer some solution to mitigate this interruption that would be very helpful. This is my main question/concern.

      Relevant Attachments in supporthshell:
      [bmcmurra@supportshell-1 03961471]$ ll
      total 132
      drwxrwxrwx+ 3 yank yank 54 Oct 16 16:02 0010-inspect-lbs-i2l.tar.gz
      drwxrwxrwx+ 3 yank yank 55 Oct 16 16:02 0020-inspect-openshift-storage.tar.gz
      rw-rw-rw+ 1 yank yank 112257 Oct 16 17:11 0030-image.png
      drwxrwxrwx+ 3 yank yank 59 Oct 16 17:24 0040-must-gather-openshift-logging.tar.gz
      drwxrwxrwx+ 3 yank yank 59 Oct 17 18:22 0050-must-gather-openshift-storage.tar.gz

      Let me know if you require anymore data than what's already in supportshell.

      Thanks

      Brandon McMurray
      Technical Support Engineer, RHCE
      Software Defined Storage and Openshift Data Foundation

              tnielsen@redhat.com Travis Nielsen
              rhn-support-bmcmurra Brandon McMurray
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: