Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-243

[2317473] OSD Flapping and Crash after unsetting nobackfill and norecovery

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.14
    • ceph/RADOS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      We completed these actions for the customer below. Then at the end after removing the nobackfill and norecover at the end, the OSDs continued to flap and eventually crashed and went down.

      We were able to capture logs though in comment #218 #219

      https://gss--c.vf.force.com/apex/Case_View?id=5006R0000210QMRQA2&sfdc.override=1#filters_Comment,Transcript,File%20Attachment,External%20Tracker%20Update/pageNumber_1


      #actions done

      ```
      Action Plan:
      1) start osd-0, see how it behaves. We want to follow [1] to get the osd back up and running from it's sleep state
      2) If osd-0 is crashing after bringing the osds up, we need to gather the osd logs. DO NO proceed!
      3) If osd-0 stays up and running, we need to move onto the other osds [7,0,2,3] (pick 1 we have't done)
      4) Stopping one osd at a time, doing a backup of the snap object [2] and PG [3] , then removing the snap object from the osd [4].
      5) Start the down osd again... if we crash, get the logs and stop...
      6) Do step 3 until we've remove the object from all osds

      NOTES:

      • Please do not remove nobackfill/norecover at this stage. We can do this later.

      [1] https://access.redhat.com/solutions/6523031
      [2] ceph-objectstore-tool -data-path /var/lib/ceph/osd/ceph

      {id}/ -pgid 2.14 --op export --file /tmp/pg-2-14-osd{id}

      .bin
      [3] ceph-objectstore-tool -data-path /var/lib/ceph/osd/ceph

      {id}/ -pgid 2.14 '["2.14",{"oid":"rbd_data.1b8f1e13abd0c1.00000000000672bc","key":"","snapid":8,"hash":4178376852,"max":0,"pool":2,"namespace":"","max":0}]' get-bytes > ceph-osd{id}

      -rbd_data.1b8f1e13abd0c1.00000000000672bc.snapid-8.bin
      [4] ceph-objectstore-tool -data-path /var/lib/ceph/osd/ceph

      {id}

      / --pgid 2.14 '

      {"oid":"rbd_data.1b8f1e13abd0c1.00000000000672bc","key":"","snapid":8,"hash":4178376852,"max":0,"pool":2,"namespace":"","max":0}

      ' remove
      ```

      Version of all relevant components (if applicable):

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:

              rhn-support-pdhange Prashant Dhange
              rhn-support-allee Alexander Lee
              Prashant Dhange
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated: