Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35943

[release-4.14] LVMS: LVMCluster is degraded after SNO cluster reboot due to mismatch of block device names (/dev/sdx)

XMLWordPrintable

    • No
    • 2
    • OCPEDGE Sprint 257
    • 1
    • False
    • Hide

      None

      Show
      None

      This is an LVMS Bug Report:

      Please create & attach a must-gather as indicated by this Guide to collect LVMS relevant data from the Cluster (linked to the latest version, use older versions of the documentation for older OCP releases as applicable

      Please make sure that you describe your storage configuration in detail. List all devices that you plan to work with for LVMS as well as any relevant machine configuration data to make it easier for an engineer to help out.

      Description of problem:

      LVM storage is degraded after SNO node reboot
      

      Version-Release number of selected component (if applicable):

       LVM operator version: 4.14.6

      Steps to Reproduce:

       1. Add additonal disk to the SNO node
       2. Install LVM Storage operator
       3. Configure LVMCluster object like below to automatically discover the disks:
      
      apiVersion: lvm.topolvm.io/v1alpha1
      kind: LVMCluster
      metadata:
      .....
      .....
      spec:
        storage:
          deviceClasses:
          - fstype: xfs
            name: vg1
            thinPoolConfig:
              name: thin-pool-1
              overprovisionRatio: 10
              sizePercent: 90
       
      4. After successfull configuration, take SNO node reboot (The device name change may take a few reboots).
      
      5. LVM Cluster is now degraded:
      
      apiVersion: lvm.topolvm.io/v1alpha1
      kind: LVMCluster
      metadata:
      .....
      .....
      spec:
        storage:
          deviceClasses:
          - fstype: xfs
            name: vg1
            thinPoolConfig:
              name: thin-pool-1
              overprovisionRatio: 10
              sizePercent: 90
      status:
        deviceClassStatuses:
        - name: vg1
          nodeStatus:
          - devices:
            - /dev/sdc
            node: 98-f2-b3-20-8c-4c
            reason: 'failed to create/extend volume group vg1: failed to create or extend
              volume group "vg1" using command ''/usr/sbin/vgextend vg1 /dev/sda'': exit
              status 5'
            status: Degraded
        state: Degraded

      Actual results:

       Can't use LVM storage the topolvm-node pod is in crashloopbackoff state.

      Expected results:

      The LVM storage cluster should work consistently even after node reboot.
      

      Additional info:

      Instead of using the device names to perform `vg` operations, `devicepaths` of disks can be used to make the storage configuration persistent even after reboot. Customer can't use the below snippet as configuration for `LVMcluster` as they need to know the disks ahead of time:
      ~~~
      spec:
        storage:
          deviceClasses:
            - name: vg1
              deviceSelector:
                paths:
                - /dev/disk/by-path/pci-0000:87:00.0-nvme-1
                - /dev/disk/by-path/pci-0000:88:00.0-nvme-1
              thinPoolConfig:
                name: thin-pool-1
                sizePercent: 90
                overprovisionRatio: 10
      ~~~
      
      A script something like this [1] can help to dynamically figure out additional disks and use the device paths for making them ready for LVM storage.
      
      [1] https://docs.openshift.com/container-platform/4.13/scalability_and_performance/recommended-performance-scale-practices/recommended-etcd-practices.html#move-etcd-different-disk_recommended-etcd-practices
      

            sakbas@redhat.com Suleyman Akbas
            rhn-support-dpateriy Divyam Pateriya
            Rahul Deore Rahul Deore
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: