-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
False
-
-
False
-
CLOSED
-
No
-
---
-
---
-
Storage Core Sprint 220
-
Urgent
+++ This bug was initially created as a clone of Bug #2088476 +++
+++ This bug was initially created as a clone of Bug #2021354 +++
Description of problem:
Attempted to restore online snapshot of Windows 2k19 server VM, restore hung with Pending PVC apparently due to mismatch between old volume and new volume sizes (must be identical)
Version-Release number of selected component (if applicable):
OCP 4.9.0
CNV 4.9.0
How reproducible:
Will reinstall to retest.
Steps to Reproduce:
1. Online snapshot Windows VM
2. Shut down VM
3. Restore from snapshot
Actual results:
ProvisioningFailed warning
Expected results:
Restored VM restarts from snapshot
Additional info:
oc describe pvc restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk
Name: restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk
Namespace: default
StorageClass: cnv-integration-svm
Status: Pending
Volume:
Labels: app=containerized-data-importer
app.kubernetes.io/component=storage
app.kubernetes.io/managed-by=cdi-controller
app.kubernetes.io/part-of=hyperconverged-cluster
app.kubernetes.io/version=v4.9.0
cdi-controller=cdi-tmp-fb49b48e-86aa-4905-96e0-9c759e411317
cdi.kubevirt.io=cdi-smart-clone
Annotations: k8s.io/CloneOf: true
k8s.io/SmartCloneRequest: true
restore.kubevirt.io/name: wintest-2021-11-8-02-restore-4hnh7u
volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
DataSource:
APIGroup: snapshot.storage.k8s.io
Kind: VolumeSnapshot
Name: vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk
Used By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Provisioning 6m28s (x13 over 11m) csi.trident.netapp.io_trident-csi-5fddc99d78-qmjpr_d18d90d8-405e-43bd-800d-297afff97bd3 External provisioner is provisioning volume for claim "default/restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk"
Warning ProvisioningFailed 6m28s (x13 over 11m) csi.trident.netapp.io_trident-csi-5fddc99d78-qmjpr_d18d90d8-405e-43bd-800d-297afff97bd3 failed to provision volume with StorageClass "cnv-integration-svm": error getting handle for DataSource Type VolumeSnapshot by Name vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk: requested volume size 10383777792 is less than the size 12058169344 for the source snapshot vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk
Normal ExternalProvisioning 5m58s (x26 over 11m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "csi.trident.netapp.io" or manually created by system administrator
NAME SOURCEKIND SOURCENAME PHASE READYTOUSE CREATIONTIME ERROR
virtualmachinesnapshot.snapshot.kubevirt.io/wintest-2021-11-8 VirtualMachine wintest Succeeded true 58m
virtualmachinesnapshot.snapshot.kubevirt.io/wintest-2021-11-8-02 VirtualMachine wintest Succeeded true 55m
NAME TARGETKIND TARGETNAME COMPLETE RESTORETIME ERROR
virtualmachinerestore.snapshot.kubevirt.io/wintest-2021-11-8-02-restore-4hnh7u VirtualMachine wintest false
oc describe vmrestore wintest-2021-11-8-02-restore-4hnh7u
Name: wintest-2021-11-8-02-restore-4hnh7u
Namespace: default
Labels: <none>
Annotations: <none>
API Version: snapshot.kubevirt.io/v1alpha1
Kind: VirtualMachineRestore
Metadata:
Creation Timestamp: 2021-11-08T23:17:16Z
Generation: 3
Managed Fields:
API Version: snapshot.kubevirt.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:target:
.:
f:apiGroup:
f:kind:
f:name:
f:virtualMachineSnapshotName:
Manager: Mozilla
Operation: Update
Time: 2021-11-08T23:17:16Z
API Version: snapshot.kubevirt.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
.:
k:
:
f:status:
.:
f:complete:
f:conditions:
f:restores:
Manager: virt-controller
Operation: Update
Time: 2021-11-08T23:17:16Z
Owner References:
API Version: kubevirt.io/v1
Block Owner Deletion: true
Controller: true
Kind: VirtualMachine
Name: wintest
UID: 89daea3d-8712-4baf-b565-47b1ece4ff23
Resource Version: 5624615
UID: 552369e5-42cf-49d2-9d00-e35602a7cb17
Spec:
Target:
API Group: kubevirt.io
Kind: VirtualMachine
Name: wintest
Virtual Machine Snapshot Name: wintest-2021-11-8-02
Status:
Complete: false
Conditions:
Last Probe Time: <nil>
Last Transition Time: 2021-11-08T23:17:16Z
Reason: Creating new PVCs
Status: True
Type: Progressing
Last Probe Time: <nil>
Last Transition Time: 2021-11-08T23:17:16Z
Reason: Waiting for new PVCs
Status: False
Type: Ready
Restores:
Persistent Volume Claim: restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk
Volume Name: rootdisk
Volume Snapshot Name: vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk
Events: <none>
— Additional comment from Adam Litke on 2021-11-10 12:36:25 UTC —
Shelly, please take a look.
— Additional comment from Chandler Wilkerson on 2021-11-11 15:51:24 UTC —
Additional debugging:
I created a new VM, win-resize-test with a 40Gi root disk and used virtctl guestfs to resize the OS into the larger PVC, reserving about 5.5% for overhead as CDI does.
I was able to snapshot and restore this VM without issue.
Additionally, just to ensure the environment, I created a RHEL8 VM using default settings. It performed a snapshot and restore without issue.
— Additional comment from Yan Du on 2021-11-17 13:38:31 UTC —
Hi, Chandler
Do you mean it works and we can close the bug?
— Additional comment from Chandler Wilkerson on 2021-11-17 13:43:16 UTC —
(In reply to Yan Du from comment #3)
> Hi, Chandler
> Do you mean it works and we can close the bug?
No, I mean that the underlying system is capable of performing VM snapshots on both a RHEL8 VM, and a Windows VM that has had its PVC resized, but the original problem persists.
I will try increasing the size of the Windows boot source image and see if it helps.
— Additional comment from Chandler Wilkerson on 2021-11-17 22:45:07 UTC —
It is possible the size mismatch is coming from a bug in UI.
I recreated the base win2k19 image with a 22Gi DV.
$ oc -n openshift-virtualization-os-images get dv win2k19 -o yaml| grep storage
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
cdi.kubevirt.io/storage.clone.token: eyJhbGciOiJQUzI1NiIsImtpZCI6IiJ9.eyJleHAiOjE2MzcxNjg2MTUsImlhdCI6MTYzNzE2ODMxNSwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJ3aW5kb3dzLWluc3RhbGwtcm9vdGRpc2siLCJuYW1lc3BhY2UiOiJrdWJldmlydC1naXRvcHMiLCJuYmYiOjE2MzcxNjgzMTUsIm9wZXJ0YXRpb24iOiJDbG9uZSIsInBhcmFtcyI6eyJ0YXJnZXROYW1lIjoid2luMmsxOSIsInRhcmdldE5hbWVzcGFjZSI6Im9wZW5zaGlmdC12aXJ0dWFsaXphdGlvbi1vcy1pbWFnZXMifSwicmVzb3VyY2UiOnsiZ3JvdXAiOiIiLCJyZXNvdXJjZSI6InBlcnNpc3RlbnR2b2x1bWVjbGFpbXMiLCJ2ZXJzaW9uIjoidjEifX0.ucsCMmluPrSXQaOfhY6e_CLC8b0d56zgoXop82FnDU7mcVvGUj0cdT0asxDZ5I0_2nUS5DrqviDR2BYU79yhkwAmDLbY5NroumV9CufIqBjnjeQpVX50Pzh0dB-0byNTxZR8HEqdBCGq8QYJgNU9C_Cva0OpDBQmFzPqJotv9oVlUHnM-gQi__t59KpJIJrhArAO95KnsNHVKN2jEvzGzQT0YAsz67cvXxo-xzCZN0Md-rfofM-TRyNmBAmfO3ugjUQAP09APXoTj6k814TDD46Ry_I0Br-5QT2Isv0TLhevr_tvsM9HhQL3IwtVbuYpbcJ2_Jzm7aiCIbwZzjWsAw
{"apiVersion":"cdi.kubevirt.io/v1beta1","kind":"DataVolume","metadata":{"annotations":
,"name":"win2k19","namespace":"openshift-virtualization-os-images"},"spec":{"pvc":{"accessModes":["ReadWriteOnce"],"resources":{"requests":
{"storage":"22Gi"}}},"source":{"pvc":{"name":"windows-install-rootdisk","namespace":"kubevirt-gitops"}}}}
storage: 22Gi
and its pvc:
$ oc -n openshift-virtualization-os-images get pvc win2k19
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
win2k19 Bound pvc-353b979e-a9bf-4fb4-8518-e379dc0affc4 22Gi RWO cnv-integration-svm 5h29m
I then create a VM using the console wizard, and it says it is creating a VM with 9G storage.
Looking at the resultant DV:
$ oc get dv win2k19-magnificent-manatee -o yaml | grep storage
cdi.kubevirt.io/storage.clone.token: eyJhbGciOiJQUzI1NiIsImtpZCI6IiJ9.eyJleHAiOjE2MzcxODgyMjgsImlhdCI6MTYzNzE4NzkyOCwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJ3aW4yazE5IiwibmFtZXNwYWNlIjoib3BlbnNoaWZ0LXZpcnR1YWxpemF0aW9uLW9zLWltYWdlcyIsIm5iZiI6MTYzNzE4NzkyOCwib3BlcnRhdGlvbiI6IkNsb25lIiwicGFyYW1zIjp7InRhcmdldE5hbWUiOiJ3aW4yazE5LW1hZ25pZmljZW50LW1hbmF0ZWUiLCJ0YXJnZXROYW1lc3BhY2UiOiJkZWZhdWx0In0sInJlc291cmNlIjp7Imdyb3VwIjoiIiwicmVzb3VyY2UiOiJwZXJzaXN0ZW50dm9sdW1lY2xhaW1zIiwidmVyc2lvbiI6InYxIn19.q0Dpvygf9kAIxC4Gfh8s0KNxKfR0p_YkR9S4eCT2D4HToCHVcRo07R23OMXHb-e7SdU9O9vUjSsQ5kXJ1jmSIyBjHroExvL6FYU3wU0GFsZtmkSM3bLLgTO4x6BR6ZkHqJQ34m5MOUxdTSJ0ogyB2gQ_gn0JGp-bnVzCRhsRVWw5pnv3t8jm1CVOtDtm2QZxgvpafXrdPoTYAFhHjmlh81fs0EnP5wUpR_Nu1FMYy0VOq4Y2kH0a5fGB5WUGUGjU-PeQ1KCIc6_OWFlIplmVXSq_i-yv8nhftdRokfh51NzPO-JXDqL2FgIBK13EgcQGyziVxbbpgDLoa4kIQnS9iw
storage: 10186796Ki
storageClassName: cnv-integration-svm
The storage is way low, however:
$ oc get pvc win2k19-magnificent-manatee
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
win2k19-magnificent-manatee Bound pvc-b422dcd1-87d9-4d38-b0f0-6276b14a288b 22Gi RWO cnv-integration-svm 7m51s
Snapshot:
$ oc get vmsnapshot,volumesnapshot,pvc | grep manatee
virtualmachinesnapshot.snapshot.kubevirt.io/win2k19-magnificent-manatee-2021-11-17 VirtualMachine win2k19-magnificent-manatee Succeeded true 4m32s
volumesnapshot.snapshot.storage.k8s.io/vmsnapshot-b73f4fd6-e71e-49bf-be8c-5d0fef96e960-volume-win2k19-magnificent-manatee true win2k19-magnificent-manatee 10100400Ki csi-snapclass snapcontent-6025e952-779e-44b8-87f3-5037de4ca7b1 4m33s 4m33s
persistentvolumeclaim/win2k19-magnificent-manatee Bound pvc-b422dcd1-87d9-4d38-b0f0-6276b14a288b 22Gi RWO cnv-integration-svm 17m
It then creates a restore of insufficient size:
$ oc get pvc restore-21fef1e1-62b9-4caa-bfed-e38a7538d99d-win2k19-magnificent-manatee -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
k8s.io/CloneOf: "true"
k8s.io/SmartCloneRequest: "true"
restore.kubevirt.io/name: win2k19-magnificent-manatee-2021-11-17-restore-fmtfni
volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io
creationTimestamp: "2021-11-17T22:43:34Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app: containerized-data-importer
app.kubernetes.io/component: storage
app.kubernetes.io/managed-by: cdi-controller
app.kubernetes.io/part-of: hyperconverged-cluster
app.kubernetes.io/version: v4.9.0
cdi-controller: cdi-tmp-f8dfe527-ade8-48a5-ad20-9bf514829cac
cdi.kubevirt.io: cdi-smart-clone
name: restore-21fef1e1-62b9-4caa-bfed-e38a7538d99d-win2k19-magnificent-manatee
namespace: default
ownerReferences: - apiVersion: kubevirt.io/v1
blockOwnerDeletion: true
controller: true
kind: VirtualMachine
name: win2k19-magnificent-manatee
uid: 2ae73de8-d362-49d1-aad3-86f21cb8a08b
resourceVersion: "14966926"
uid: 1e43066c-7f6a-4505-b914-3d53b547b3cd
spec:
accessModes: - ReadWriteOnce
dataSource:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: vmsnapshot-b73f4fd6-e71e-49bf-be8c-5d0fef96e960-volume-win2k19-magnificent-manatee
resources:
requests:
storage: 10080872Ki
storageClassName: cnv-integration-svm
volumeMode: Filesystem
status:
phase: Pending
— Additional comment from Yan Du on 2021-11-24 13:35:00 UTC —
Hi, Shelly, Could you please help take a look?
— Additional comment from on 2021-11-29 07:44:36 UTC —
I tried looking at it again. Will talk to Chandler to get more info.
— Additional comment from Yan Du on 2022-01-26 13:28:17 UTC —
Shelly, is there any updates for this bug?
— Additional comment from Adam Litke on 2022-01-31 13:18:50 UTC —
We are unable to reproduce this bug and will close it. If it continues to be an issue please reopen it.
— Additional comment from Chandler Wilkerson on 2022-03-31 12:38:37 UTC —
This is still an issue, if it's a matter of access to the cluster, I can provide that. (and apologies if I missed earlier, and/or had a broken cluster at the time...)
— Additional comment from Chandler Wilkerson on 2022-03-31 13:42:15 UTC —
Here is another example:
My Windows install job creates a VM with the following dataVolumeTemplate:
dataVolumeTemplates:
- metadata:
name: windows-install-rootdisk
spec:
pvc:
accessModes: - ReadWriteOnce
resources:
requests:
storage: 22Gi
source:
blank: {}
The installer ends up writing around 10Gi to the disk, and I use the following DV to clone it into openshift-virtualization-os-images:
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: win2k19
namespace: openshift-virtualization-os-images
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
kubevirt.ui/provider: Microsoft
spec:
source:
pvc:
namespace: kubevirt-gitops
name: windows-install-rootdisk
storage:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 22Gi
This ends up creating the following PVC in openshift-virtualization-os-images:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
cdi.kubevirt.io/ownedByDataVolume: openshift-virtualization-os-images/win2k19
cdi.kubevirt.io/readyForTransfer: "true"
cdi.kubevirt.io/smartCloneSnapshot: kubevirt-gitops/cdi-tmp-96428e41-993d-4c45-9542-db80a83388ca
cdi.kubevirt.io/storage.condition.running: "False"
cdi.kubevirt.io/storage.condition.running.message: Clone Complete
cdi.kubevirt.io/storage.condition.running.reason: Completed
cdi.kubevirt.io/storage.populatedFor: win2k19
k8s.io/CloneOf: "true"
k8s.io/SmartCloneRequest: "true"
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io
volume.kubernetes.io/storage-provisioner: csi.trident.netapp.io
creationTimestamp: "2022-03-30T18:47:20Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
alerts.k8s.io/KubePersistentVolumeFillingUp: disabled
app: containerized-data-importer
app.kubernetes.io/component: storage
app.kubernetes.io/managed-by: cdi-controller
app.kubernetes.io/part-of: hyperconverged-cluster
app.kubernetes.io/version: 4.10.0
cdi-controller: cdi-tmp-96428e41-993d-4c45-9542-db80a83388ca
cdi.kubevirt.io: cdi-smart-clone
name: win2k19
namespace: openshift-virtualization-os-images
ownerReferences: - apiVersion: cdi.kubevirt.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: DataVolume
name: win2k19
uid: 96428e41-993d-4c45-9542-db80a83388ca
resourceVersion: "17272464"
uid: b08eea7e-ef6f-411f-a20f-ceb81735eaea
spec:
accessModes: - ReadWriteOnce
dataSource:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: cdi-tmp-96428e41-993d-4c45-9542-db80a83388ca
resources:
requests:
storage: 10377400Ki
storageClassName: cnv-integration-svm
volumeMode: Filesystem
volumeName: pvc-1784ae57-2465-42b4-bcb9-d710f83271c2
status:
accessModes: - ReadWriteOnce
capacity:
storage: 22Gi
phase: Bound
Note that under spec.resources.requests.storage, something has reduced the size to 10377400Ki which is 9.89 GiB
— Additional comment from Adam Litke on 2022-05-09 12:20:34 UTC —
Chandler,
— Additional comment from Adam Litke on 2022-05-09 19:02:22 UTC —
Shelly, I thought that this might be a dup of 2064936 but that seems impossible since this bug is related to VMSnapshotRestore which is a kubevirt feature and 2064936 deals with filesystem overhead calculations made by CDI. Please work with Chandler to further diagnose.
— Additional comment from Michael Henriksen on 2022-05-10 13:25:34 UTC —
regarding: https://bugzilla.redhat.com/show_bug.cgi?id=2021354#c11
I am not sure that the PVC is undersized. The PVC size in the status (status.capacity.storage) is (correctly) reported as 22Gi.
Regarding the value in `spec.resources.requests.storage`. When doing a "smart clone" that value is initially set from the `status.restoreSize` of the snapshot here [1]. Since it was stated that approximately 10G was written to the blank disk, this initial value could make sense.
The smart clone process then extends the PVC to the requested size by updating the PVC spec.resources.requests.storage. It is strange that the PVC does not reflect this update. But based on the current value of status.capacity.storage it appears to have been resized correctly.
— Additional comment from Michael Henriksen on 2022-05-10 19:14:46 UTC —
After digging into this for a bit, I can see how the DataVolume in [1] would be problematic if it was part a VMSnapshot+VMRestore.
The main issue is that the requested size PVC size (10377400Ki) is smaller than the actual PVC size (22Gi) and the snapshot/restore controllers do not handle this totally valid situation correctly. The restore controller will create a 10377400Ki PVC. This is obviously problematic because the snapshot may be up to 22G.
There are a couple flawed assumptions in the current snapshot/restore logic.
1. The "status.restoreSize" of a VolumeSnapshot equals "spec.resources.requests.storage" of source PVC
2. The source PVC "spec.resources.requests.storage" equals "status.capacity.storage"
With some provisioners (ceph rbd), the above is true. But clearly that is not always the case as here with netapp trident
I believe this issue can be addressed as follows:
1. VirtualMachineSnapshots should include "status.capacity.storage" for each PVC. Not necessarily the entire PVC status, but a least that part
2. VM Restore controller has to more intelligently restore PVCs
A. If storage class supports expansion
i. Create target PVC with initial size of VolumeSnapshot "status.restoreSize"
ii. Expand PVC to have size equal to source PVC "status.capacity.storage" if necessary
B. If expansion not supported
i. Create target PVC with initial size of PVC "status.capacity.storage"
ii. Hope for the best (works fine with trident)
The remaining question is how to handle with restoring from old VM snapshots. There are a couple options:
1. Validating webhook can check for each volume in VMsnapshot that VolumeSnapshot "status.restoreSize" > PVC "spec.resources.requests.storage" and reject if so
2. Instead of rejecting, create new target PVC some X% bigger than VolumeSnapshot "status.restoreSize"
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2021354#c11
— Additional comment from Michael Henriksen on 2022-05-11 14:24:50 UTC —
One more important issue to note. Smart clone does not update "spec.resources.requests.storage" if "status.capacity.storage" is >= desired target size. This is how we ended up with the PVC in [1]. We should fix that. The code is here: https://github.com/kubevirt/containerized-data-importer/blob/main/pkg/controller/datavolume-controller.go#L1337-L1348
Although this bug would be fixed with only the above fix, it has exposed some flaws in the snapshot/restore logic. Specifically, handling PVCs with more data than "spec.resources.requests.storage".
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2021354#c11
— Additional comment from Michael Henriksen on 2022-05-11 14:25:16 UTC —
One more important issue to note. Smart clone does not update "spec.resources.requests.storage" if "status.capacity.storage" is >= desired target size. This is how we ended up with the PVC in [1]. We should fix that. The code is here: https://github.com/kubevirt/containerized-data-importer/blob/main/pkg/controller/datavolume-controller.go#L1337-L1348
Although this bug would be fixed with only the above fix, it has exposed some flaws in the snapshot/restore logic. Specifically, handling PVCs with more data than "spec.resources.requests.storage".
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2021354#c11
— Additional comment from on 2022-05-15 07:20:27 UTC —
As Michael mentioned I'm working on the fix in Smart clone. The flaws in the snapshot restore process will be considered and handled.
— Additional comment from on 2022-05-19 12:57:47 UTC —
Adding here a link to the bug we opened as a result of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=2086825