Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-723

[Risk] vSphere CSI migration risk in OpenShift 4.14

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • 0
    • 0

      Description

      OCPSTRAT-82 vSphere CSI migration will occur in OpenShift 4.14 despite known associated risks that we cannot proactively fix as we are dependent on VMware to provide the fix. So far, VMware is not prioritizing to address the CNS volume not found in-memory but exists in database (91752) bug.

      Feature Goal

      For OpenShift on vSphere clusters born in OpenShift 4.12 or earlier, customers will be required to undergo the vSphere CSI migration after upgrading to OpenShift 4.14.

      Background for Context

      During OpenShift 4.13 development, we identified three bugs related to the vSphere CSI migration. These were reported to VMware. 

      These bugs are referenced in this KCS.

      • CNS volume not found in-memory but exists in the database.

      We have observed that during migration when VMDKs get migrated to FCDs, attach of converted volume fails.

      We have only observed this issue during scale testing of Openshift on vSphere. The scale tests were run with 33 workers nodes and 3 master with approximately 900 pods in the cluster that were using vSphere intree volumes. 

      VMware has published an advisory to address this issue - https://kb.vmware.com/s/article/91752

       

      • After migration to FCDs (first-class-disks) the disk can get in a state where attachment fails 

       

      This issue has been observed after migration to FCD the disk can get in a state where attachment permanently fails.  This is documented as Known Issues with a workaround in vSphere driver - 3.0 release notes - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/rn/vmware-vsphere-container-storage-plugin-30-release-notes/index.html

      Both of the above issues are covered in upstream issue -  https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2165 . The errors look similar but have different underlying reasons. We also have a downstream JIRA issue for this - https://issues.redhat.com/browse/OCPBUGS-5817 

       

      • The last two bugs require fix in vSphere itself as well as a fix in the CSI driver, the CSI driver fix depends on the vSphere fix.
      • We do not have visibility on either how frequently these bugs may be experienced by our Customers or how many Customers may be impacted by these bugs. 
      • In OpenShift 4.13, new clusters have CSI migration enabled by default, while existing clusters remained unchanged with the intention that the vSphere CSI migration would occur in OpenShift 4.14;
      • An opt-in option for vSphere CSI migration was made available in OpenShift 4.13 for customer who would like to explicitly enable migration - See documentation for more details.
      • We have an internal FAQ that summarises the 4.13 situation.
      • Following a VMWare support case we opened, VMware implemented a fix in vSphere 8.0u1, but has not yet backported fixes to vSphere 7.0, which impacts our Feature goals in OpenShift 4.14 (OCPSTRAT-82 Finalise vSphere CSI migration).
        • We have asked VMware to resolve these bugs and to backport the fix in 8.0u1 to 7.0 and additionally to patch the vSphere CSI driver migration process to protect and register FCD together.
          • We believe it could fix the two remaining issues.
          • VMware confirmed upstream that vsphere 8.0u1 environment will not hit the FCD issue.
      • VMware is not prioritizing fixes to the bugs for which workarounds were provided in OCP 4.13.
        • We keep pushing VMWare for a resolution & ETA
      • We are not able to successfully reproduced these bugs in OpenShift 4.14, but we suspect they still exist.

       

      UPDATE - August 22nd

       

      VMware fixed the two remaining issues

      “CNS volume not found in-memory but exists in database” only started from 8.0 and has been fixed in 8.0u2 (GA 8/31). The issue is not there in 7.0.

      “FCDs (first-class-disks) disk in a state attachment fails” has been fixed in 8.0u1 and 7.0p07 (maps to 7.0 U3L).

      Considerations for OpenShift 4.14

       

      In 4.14 the vSphere CSI migration will be enabled for all clusters and we removed the opt-in option we introduced in 4.13 (OCPSTRAT-82).

      The table below summarises which environments are at risk. As mentioned earlier the issues are not 100% reproducible, environments at risk are clusters that are currently using in-tree PVs and will automatically enable CSI migration after upgrading to 4.14.

       

       

      Environments CSI Migration status At risk?
      New OCP 4.14 clusters Enabled NO
      New OCP 4.13 clusters Enabled NO
      Upgraded 4.13 clusters Disabled YES if forced upgrade
      OCP 4.12 Clusters Disabled YES if forced upgrade

       

      Options being discussed for OpenShift 4.14

      Here are the different options:

      1. Do nothing in-code, only document / advertise the fact that updating to vSphere 7.0U3L+ or 8.0u2+ is strongly recommended.
        1. Looks not enough, customers may miss or overlook the recommendation
      2. Raise an alert but don’t block upgrades
        1. Better than 1. but can be overlooked
      3. Block upgrades
        1. Safest option but can be badly perceived, customers may complaint we force them to upgrade vSphere (OCP may not be the only workload running on vSphere)
      4. Block upgrades with an option to bypass with an admin ack (storage team preferred option)
        1. Upgrades are blocked but an admin can explicitly ack and choose to ignore.

       
      Additional details:

      • We can add checks to limit which clusters we block upgrades on. For example we can check if the clusters use in-tree or not. Clusters only using CSI won’t have their upgrades blocked.
      • We need to add code in both 4.12 (for EUS to EUS) and 4.13. No additional code is required in 4.14.
      • We can’t selectively block 4.12 to 4.14 upgrades but not the 4.12 to 4.13. It means 4.12 to 4.13 upgrades would also be blocked, this is another good reason to choose option 4.

       

      As of August 25th, it's been decided to go with option 4. We are tracking the 4.12 and 4.13 work with OCPBUGS-18131 and OCPBUGS-18132.

      With option 4. customers can't ignore the warning as their environment will be un-upgradable. The attached message will summarise the reason and link to the KCS. An admin can still upgrade without the right vSphere version with an explicit ack. This means that while we are greatly limiting the odds, customers may still hit the issues. If so VMware recommends to apply the workarounds and contact their support if the issue persists or if the customer require any additional help.

      Collateral to be updated for OpenShift 4.14

      • New vSphere 4.14 FAQ
      • KCS
        • Update 7011684, 7011685 & 7011681 solutions
          • Mention the KCM bug is fixed in OCP 4.14
          • Mention the CNS volume not found in-memory is fixed in 8.0.u2 (strongly recommend using this version) and not present in vSphere 7
          • Mention the FCDs (first-class-disks) disk in a state attachment fails issue is fixed in 8.0u1 and 7.0p07 (maps to 7.0 U3L). strongly recommend using this version
      • Release notes
      • CSI migration doc
      • TE & knowledge transfer via PLMCORE-6191
      • Mentions in what's new in OCP4.14

      Teams to be included in Discussion

      Storage, SPLAT, OTA, Docs, ODF, CEE

       

            julim Ju Lim
            rhn-support-mdineen Margaret Dineen
            Gregory Charot, Hemant Kumar, Jan Safranek, Jonathan Dobson, Joseph Callen, Ramon Acedo, Richard Vanderpool
            Hemant Kumar Hemant Kumar
            Stephanie Stout Stephanie Stout
            Hemant Kumar Hemant Kumar
            Hemant Kumar Hemant Kumar
            Gregory Charot Gregory Charot
            Eric Rich Eric Rich
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: