Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11375

NVMe disk by-id rename breaks LSO/ODF

    XMLWordPrintable

Details

    • Important
    • No
    • 3
    • Sprint 234 - Team OSInt, Sprint 235
    • 2
    • Approved
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      I suspect this is an underlying RHEL kernel bug/change, but so far I have only tested on OCP/RHCOS, will update if I can confirm... 
      
      Upgrading to 9.2 using 4.13-rc.2 / 4.13.0-0.nightly-2023-03-29-235439 on a baremetal cluster was stuck on my 2nd worker node, I discovered it was due to OSD PDBs preventing further progress because booting into the new kernel caused a disk rename:
      8.6 was:  /dev/disk/by-id/nvme-Dell_Express_Flash_PM1725a_3.2TB_AIC__S3B1NA0JC00067
      9.2 now: /dev/disk/by-id/nvme-Dell_Express_Flash_PM1725a_3.2TB_AIC_______S3B1NA0JC00067
      
      I am using LSO autodiscovery so did not hardcode my disk by-ids at install time. 
      
      I have other HDDs in the system not used by LSO/ODF and do not see renames to the by-ids for those (only changes to symbolic sdX naming links), so it may be specific to NVMe by-id naming. 

      Version-Release number of selected component (if applicable):

      4.13.0-rc.2
      5.14.0-284.4.1.el9_2.x86_64

      How reproducible:

      The by-id renaming happened for all 4 of my workers w/ NVMes 

      Steps to Reproduce:

      1. Install 8.6-based OCP + LSO + ODF
      2. Upgrade to 9.2-based OCP
      3. Check OSD pods stuck in Init:
        Warning  FailedMapVolume  <invalid> (x6 over 0s)  kubelet            MapVolume.EvalHostSymlinks failed for volume "local-pv-4a847404" : lstat /dev/disk/by-id/nvme-Dell_Express_Flash_PM1725a_3.2TB_AIC__S3B1NA0JC00067: no such file or directory 

      Actual results:

      Upgrade stalled, could recover by manually deleting storage PDBs, but LSO & StorageCluster needs to be reinstalled

      Expected results:

      Upgrade to new kernel does not disrupt storage

      Additional info:

      ex. vim diff output, 8.6 on the left 9.2 on the right:

        lrwxrwxrwx. 1 root root  nvme-Dell_Express_Flash_PM1725a_3.2TB_AIC__S3B1NA0JC00084 -> ../../nvme0n1                |  lrwxrwxrwx. 1 root root  nvme-Dell_Express_Flash_PM1725a_3.2TB_AIC_______S3B1NA0JC00084 -> ../../nvme0n1
        lrwxrwxrwx. 1 root root  nvme-eui.334231304ac000840025384100000002 -> ../../nvme0n1                                |  lrwxrwxrwx. 1 root root  nvme-eui.334231304ac000840025384100000002 -> ../../nvme0n1
        lrwxrwxrwx. 1 root root  scsi-36d09466073c253002300be27de2fb838 -> ../../sda                                       |  lrwxrwxrwx. 1 root root  scsi-36d09466073c253002300be27de2fb838 -> ../../sdc

      Attachments

        Issue Links

          Activity

            People

              jlebon1@redhat.com Jonathan Lebon
              jhopper@redhat.com Jenifer Abrams
              Michael Nguyen Michael Nguyen
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: