Uploaded image for project: 'OpenShift Edge Enablement'
  1. OpenShift Edge Enablement
  2. OCPEDGE-70

Software RAID via mdadm on LVMS

XMLWordPrintable

    • Software RAID via mdadm on LVMS
    • Product / Portfolio Work
    • OCPSTRAT-495LVM Storage support software RAID
    • 0% To Do, 0% In Progress, 100% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • Green
    • Hide
      2023-12-05:
      Dev - Green - Work on this epic has been focused on enabling Software RAID externally via mdadm through Workaround documentation
      Docs - Green - Not started
      QE - Green - Test in Progress
      Show
      2023-12-05 : Dev - Green - Work on this epic has been focused on enabling Software RAID externally via mdadm through Workaround documentation Docs - Green - Not started QE - Green - Test in Progress
    • M

      OCP/Telco Definition of Done
      Epic Template descriptions and documentation.

      <--- Cut-n-Paste the entire contents of this description into your new Epic --->

      Epic Goal

      • Enable Resilient Setup of LVM Clusters per Node
      • Allow Dynamic Configuration of various RAID arrays to use as underlying volumes for the provisioned PVCs
      • Allow LVM based Software RAID
      • NOT have high availability across nodes
      • NOT support hardware controllers in API, this can be done on deviceSelector level

      Why is this important?

      • Resilient per-Node setup is critical for proper production layouts as failure of a disk within a node can otherwise destroy data.
      • RAID allows a much easier way to stop drive failures from impacting workloads. This will be the first time we can allow a seamless recovery path without redeploying LVMCluster and using a VolumeSnapshot.

      Scenarios

      1. Setup a Thin-Pool in LVM Cluster and make it highly available per Node through RAID1, RAID5, RAID10 (the most commonly used in real Production environments)
      2. Optional: Speed up / Aggregate volumes with RAID0

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • ...

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      Open questions::

      1. New introduction of validation of devices in deviceSelector / optionalDeviceSelector. If the devices would not allow the RAID configuration we have to report or forbid this accordingly. If we do multi-node, we would of course have to extend the whole validation accordingly to a validation per node. This could make the status in the LVMCluster quite complex and we need a proper API design
      2. RAID extension is not easily possible after the RAID array has been converted to a thin pool due to resynchonization requirements of RAID. Also the metadata pool cannot be extended via the API once it has been created. This is difficult because a simple "extend" of volumes is no longer possible. Also we now have specific requirements on when a RAID pool can be extended.
      3. The creation of a RAID array needs much more time in reconciliation to be synced (when initial synchronization and zeroing is active), especially with large disks. Can lead to issues and even congest the node resources.
      4. LVM2 does not support the direct creation of a thin pool as a RAID array. So we have to create 2 RAID pools for metadata and data to make everything work. This has implications for recovery and maintenance in operation. We would need very detailed guides that explain the whole thing and make it safe. Especially resyncing and repairing is in the foreground. 

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              rh-ee-jmoller Jakob Moeller (Inactive)
              rhn-support-cscribne Chad Scribner
              None
              Rahul Deore Rahul Deore
              Daniel Macpherson Daniel Macpherson
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: