-
Risk
-
Resolution: Done
-
Critical
-
None
-
None
-
False
-
-
False
-
0
Description
OCPSTRAT-82 vSphere CSI migration will occur in OpenShift 4.14 despite known associated risks that we cannot proactively fix as we are dependent on VMware to provide the fix. So far, VMware is not prioritizing to address the CNS volume not found in-memory but exists in database (91752) bug.
Feature Goal
For OpenShift on vSphere clusters born in OpenShift 4.12 or earlier, customers will be required to undergo the vSphere CSI migration after upgrading to OpenShift 4.14.
Background for Context
During OpenShift 4.13 development, we identified three bugs related to the vSphere CSI migration. These were reported to VMware.
These bugs are referenced in this KCS.
- kube-controller-manager (KCM) gets restarted while it is detaching an in-tree vSphere volume from a node
-
- https://github.com/kubernetes/kubernetes/issues/117091
- This one is fixed by RH upstream and will be included in 4.14 (see
OCPBUGS-16166)
-
- CNS volume not found in-memory but exists in the database.
We have observed that during migration when VMDKs get migrated to FCDs, attach of converted volume fails.
We have only observed this issue during scale testing of Openshift on vSphere. The scale tests were run with 33 workers nodes and 3 master with approximately 900 pods in the cluster that were using vSphere intree volumes.
VMware has published an advisory to address this issue - https://kb.vmware.com/s/article/91752
- After migration to FCDs (first-class-disks) the disk can get in a state where attachment fails
This issue has been observed after migration to FCD the disk can get in a state where attachment permanently fails. This is documented as Known Issues with a workaround in vSphere driver - 3.0 release notes - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/rn/vmware-vsphere-container-storage-plugin-30-release-notes/index.html
Both of the above issues are covered in upstream issue - https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2165 . The errors look similar but have different underlying reasons. We also have a downstream JIRA issue for this - https://issues.redhat.com/browse/OCPBUGS-5817
- The last two bugs require fix in vSphere itself as well as a fix in the CSI driver, the CSI driver fix depends on the vSphere fix.
- We do not have visibility on either how frequently these bugs may be experienced by our Customers or how many Customers may be impacted by these bugs.
- In OpenShift 4.13, new clusters have CSI migration enabled by default, while existing clusters remained unchanged with the intention that the vSphere CSI migration would occur in OpenShift 4.14;
- An opt-in option for vSphere CSI migration was made available in OpenShift 4.13 for customer who would like to explicitly enable migration - See documentation for more details.
- We have an internal FAQ that summarises the 4.13 situation.
- Following a VMWare support case we opened, VMware implemented a fix in vSphere 8.0u1, but has not yet backported fixes to vSphere 7.0, which impacts our Feature goals in OpenShift 4.14 (
OCPSTRAT-82Finalise vSphere CSI migration).- We have asked VMware to resolve these bugs and to backport the fix in 8.0u1 to 7.0 and additionally to patch the vSphere CSI driver migration process to protect and register FCD together.
- We believe it could fix the two remaining issues.
- VMware confirmed upstream that vsphere 8.0u1 environment will not hit the FCD issue.
- We have asked VMware to resolve these bugs and to backport the fix in 8.0u1 to 7.0 and additionally to patch the vSphere CSI driver migration process to protect and register FCD together.
- VMware is not prioritizing fixes to the bugs for which workarounds were provided in OCP 4.13.
- We keep pushing VMWare for a resolution & ETA
- We are not able to successfully reproduced these bugs in OpenShift 4.14, but we suspect they still exist.
UPDATE - August 22nd
VMware fixed the two remaining issues
“CNS volume not found in-memory but exists in database” only started from 8.0 and has been fixed in 8.0u2 (GA 8/31). The issue is not there in 7.0.
“FCDs (first-class-disks) disk in a state attachment fails” has been fixed in 8.0u1 and 7.0p07 (maps to 7.0 U3L).
Considerations for OpenShift 4.14
In 4.14 the vSphere CSI migration will be enabled for all clusters and we removed the opt-in option we introduced in 4.13 (OCPSTRAT-82).
The table below summarises which environments are at risk. As mentioned earlier the issues are not 100% reproducible, environments at risk are clusters that are currently using in-tree PVs and will automatically enable CSI migration after upgrading to 4.14.
Environments | CSI Migration status | At risk? |
---|---|---|
New OCP 4.14 clusters | Enabled | NO |
New OCP 4.13 clusters | Enabled | NO |
Upgraded 4.13 clusters | Disabled | YES if forced upgrade |
OCP 4.12 Clusters | Disabled | YES if forced upgrade |
Options being discussed for OpenShift 4.14
Here are the different options:
- Do nothing in-code, only document / advertise the fact that updating to vSphere 7.0U3L+ or 8.0u2+ is strongly recommended.
- Looks not enough, customers may miss or overlook the recommendation
- Raise an alert but don’t block upgrades
- Better than 1. but can be overlooked
- Block upgrades
- Safest option but can be badly perceived, customers may complaint we force them to upgrade vSphere (OCP may not be the only workload running on vSphere)
- Block upgrades with an option to bypass with an admin ack (storage team preferred option)
- Upgrades are blocked but an admin can explicitly ack and choose to ignore.
Additional details:
- We can add checks to limit which clusters we block upgrades on. For example we can check if the clusters use in-tree or not. Clusters only using CSI won’t have their upgrades blocked.
- We need to add code in both 4.12 (for EUS to EUS) and 4.13. No additional code is required in 4.14.
- We can’t selectively block 4.12 to 4.14 upgrades but not the 4.12 to 4.13. It means 4.12 to 4.13 upgrades would also be blocked, this is another good reason to choose option 4.
As of August 25th, it's been decided to go with option 4. We are tracking the 4.12 and 4.13 work with OCPBUGS-18131 and OCPBUGS-18132.
With option 4. customers can't ignore the warning as their environment will be un-upgradable. The attached message will summarise the reason and link to the KCS. An admin can still upgrade without the right vSphere version with an explicit ack. This means that while we are greatly limiting the odds, customers may still hit the issues. If so VMware recommends to apply the workarounds and contact their support if the issue persists or if the customer require any additional help.
Collateral to be updated for OpenShift 4.14
- New vSphere 4.14 FAQ
- KCS
- Update 7011684, 7011685 & 7011681 solutions
- Mention the KCM bug is fixed in OCP 4.14
- Mention the CNS volume not found in-memory is fixed in 8.0.u2 (strongly recommend using this version) and not present in vSphere 7
- Mention the FCDs (first-class-disks) disk in a state attachment fails issue is fixed in 8.0u1 and 7.0p07 (maps to 7.0 U3L). strongly recommend using this version
- Update 7011684, 7011685 & 7011681 solutions
- Release notes
- CSI migration doc
- TE & knowledge transfer via PLMCORE-6191
- Mentions in what's new in OCP4.14
Teams to be included in Discussion
Storage, SPLAT, OTA, Docs, ODF, CEE
- depends on
-
OCPBUGS-18131 Block upgrade to 4.13 from 4.12 for older versions with admin-ack
- Closed
-
OCPBUGS-18132 Block upgrade to 4.14 from 4.13 for older versions with admin-ack
- Closed
- relates to
-
OCPSTRAT-82 Finalise vSphere CSI migration
- Closed