Loading...

XML

Word

Printable

Type: Risk
Resolution: Done
Priority: Critical
Fix Version/s: openshift-4.14
Affects Version/s: None
Component/s: None
Labels:
- csi
- storage
- vsphere

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
None
Story Points:
None

Target Version:
None
Sprint:
None

Risk Score Assessment:
Not Assessed
Risk Score:
64
Risk Category:
Legal
Risk Impact:
Medium
Risk Probability:
Unlikely
Risk Response:
Avoid

Description

~~OCPSTRAT-82~~ vSphere CSI migration will occur in OpenShift 4.14 despite known associated risks that we cannot proactively fix as we are dependent on VMware to provide the fix. So far, VMware is not prioritizing to address the CNS volume not found in-memory but exists in database (91752) bug.

Feature Goal

For OpenShift on vSphere clusters born in OpenShift 4.12 or earlier, customers will be required to undergo the vSphere CSI migration after upgrading to OpenShift 4.14.

Background for Context

During OpenShift 4.13 development, we identified three bugs related to the vSphere CSI migration. These were reported to VMware.

These bugs are referenced in this KCS.

kube-controller-manager (KCM) gets restarted while it is detaching an in-tree vSphere volume from a node
- - https://github.com/kubernetes/kubernetes/issues/117091
  - This one is fixed by RH upstream and will be included in 4.14 (see ~~OCPBUGS-16166~~)

CNS volume not found in-memory but exists in the database.

We have observed that during migration when VMDKs get migrated to FCDs, attach of converted volume fails.

We have only observed this issue during scale testing of Openshift on vSphere. The scale tests were run with 33 workers nodes and 3 master with approximately 900 pods in the cluster that were using vSphere intree volumes.

VMware has published an advisory to address this issue - https://kb.vmware.com/s/article/91752

After migration to FCDs (first-class-disks) the disk can get in a state where attachment fails

This issue has been observed after migration to FCD the disk can get in a state where attachment permanently fails. This is documented as Known Issues with a workaround in vSphere driver - 3.0 release notes - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/rn/vmware-vsphere-container-storage-plugin-30-release-notes/index.html

Both of the above issues are covered in upstream issue - https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2165 . The errors look similar but have different underlying reasons. We also have a downstream JIRA issue for this - https://issues.redhat.com/browse/OCPBUGS-5817

The last two bugs require fix in vSphere itself as well as a fix in the CSI driver, the CSI driver fix depends on the vSphere fix.
We do not have visibility on either how frequently these bugs may be experienced by our Customers or how many Customers may be impacted by these bugs.
In OpenShift 4.13, new clusters have CSI migration enabled by default, while existing clusters remained unchanged with the intention that the vSphere CSI migration would occur in OpenShift 4.14;
An opt-in option for vSphere CSI migration was made available in OpenShift 4.13 for customer who would like to explicitly enable migration - See documentation for more details.
We have an internal FAQ that summarises the 4.13 situation.

Following a VMWare support case we opened, VMware implemented a fix in vSphere 8.0u1, but has not yet backported fixes to vSphere 7.0, which impacts our Feature goals in OpenShift 4.14 (~~OCPSTRAT-82~~ Finalise vSphere CSI migration).
- We have asked VMware to resolve these bugs and to backport the fix in 8.0u1 to 7.0 and additionally to patch the vSphere CSI driver migration process to protect and register FCD together.
  - We believe it could fix the two remaining issues.
  - VMware confirmed upstream that vsphere 8.0u1 environment will not hit the FCD issue.
VMware is not prioritizing fixes to the bugs for which workarounds were provided in OCP 4.13.
- We keep pushing VMWare for a resolution & ETA
We are not able to successfully reproduced these bugs in OpenShift 4.14, but we suspect they still exist.

UPDATE - August 22nd

VMware fixed the two remaining issues

“CNS volume not found in-memory but exists in database” only started from 8.0 and has been fixed in 8.0u2 (GA 8/31). The issue is not there in 7.0.

“FCDs (first-class-disks) disk in a state attachment fails” has been fixed in 8.0u1 and 7.0p07 (maps to 7.0 U3L).

Considerations for OpenShift 4.14

In 4.14 the vSphere CSI migration will be enabled for all clusters and we removed the opt-in option we introduced in 4.13 (~~OCPSTRAT-82~~).

The table below summarises which environments are at risk. As mentioned earlier the issues are not 100% reproducible, environments at risk are clusters that are currently using in-tree PVs and will automatically enable CSI migration after upgrading to 4.14.

Environments	CSI Migration status	At risk?
New OCP 4.14 clusters	Enabled	NO
New OCP 4.13 clusters	Enabled	NO
Upgraded 4.13 clusters	Disabled	YES if forced upgrade
OCP 4.12 Clusters	Disabled	YES if forced upgrade

Options being discussed for OpenShift 4.14

Here are the different options:

Do nothing in-code, only document / advertise the fact that updating to vSphere 7.0U3L+ or 8.0u2+ is strongly recommended.
1. Looks not enough, customers may miss or overlook the recommendation
Raise an alert but don’t block upgrades
1. Better than 1. but can be overlooked
Block upgrades
1. Safest option but can be badly perceived, customers may complaint we force them to upgrade vSphere (OCP may not be the only workload running on vSphere)
Block upgrades with an option to bypass with an admin ack (storage team preferred option)
1. Upgrades are blocked but an admin can explicitly ack and choose to ignore.

Additional details:

We can add checks to limit which clusters we block upgrades on. For example we can check if the clusters use in-tree or not. Clusters only using CSI won’t have their upgrades blocked.
We need to add code in both 4.12 (for EUS to EUS) and 4.13. No additional code is required in 4.14.
We can’t selectively block 4.12 to 4.14 upgrades but not the 4.12 to 4.13. It means 4.12 to 4.13 upgrades would also be blocked, this is another good reason to choose option 4.

As of August 25th, it's been decided to go with option 4. We are tracking the 4.12 and 4.13 work with ~~OCPBUGS-18131~~ and ~~OCPBUGS-18132~~.

With option 4. customers can't ignore the warning as their environment will be un-upgradable. The attached message will summarise the reason and link to the KCS. An admin can still upgrade without the right vSphere version with an explicit ack. This means that while we are greatly limiting the odds, customers may still hit the issues. If so VMware recommends to apply the workarounds and contact their support if the issue persists or if the customer require any additional help.

Collateral to be updated for OpenShift 4.14

New vSphere 4.14 FAQ
KCS
- Update 7011684, 7011685 & 7011681 solutions
  - Mention the KCM bug is fixed in OCP 4.14
  - Mention the CNS volume not found in-memory is fixed in 8.0.u2 (strongly recommend using this version) and not present in vSphere 7
  - Mention the FCDs (first-class-disks) disk in a state attachment fails issue is fixed in 8.0u1 and 7.0p07 (maps to 7.0 U3L). strongly recommend using this version
Release notes
CSI migration doc
TE & knowledge transfer via PLMCORE-6191
Mentions in what's new in OCP4.14

Teams to be included in Discussion

Storage, SPLAT, OTA, Docs, ODF, CEE

depends on

OCPBUGS-18131 Block upgrade to 4.13 from 4.12 for older versions with admin-ack

Closed

OCPBUGS-18132 Block upgrade to 4.14 from 4.13 for older versions with admin-ack

Closed

relates to

OCPSTRAT-82 Finalise vSphere CSI migration

Closed

Assignee:: Ju Lim

Reporter:: Margaret Dineen

Contributors:: Gregory Charot, Hemant Kumar, Jan Safranek, Jonathan Dobson, Joseph Callen, Ramon Acedo, Richard Vanderpool

Architect:: Hemant Kumar

Developer:: Hemant Kumar

QA Contact:: None

Doc Contact:: Stephanie Stout

Product Operations Engineering Contact:: Eric Rich

SME:: Hemant Kumar

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/07/21 2:07 PM

Updated:: 2025/07/10 4:21 PM

Resolved:: 2023/10/04 6:06 PM

Details

Description

Description

Feature Goal

Background for Context

Considerations for OpenShift 4.14

Options being discussed for OpenShift 4.14

Collateral to be updated for OpenShift 4.14

Teams to be included in Discussion

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates