Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: CNV v4.14.3
Affects Version/s: None
Component/s: Storage Platform
Labels:
- cnv-4?
- cnvbugsm
- devel_ack+
- needinfo?
- pm_ack+
- qa_ack?
- qe_test_coverage?

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
MODIFIED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2216038
Bugzilla Bug:
RHBZ: 2216038
Documentation Type:

Release Notes
Release Note Text:

Hide
If you simultaneously clone more than 1000 VMs using the provided DataSources in the openshift-virtualization-os-images namespace, it is possible that not all of the VMs will move to a running state. (BZ#2216038)

As a workaround, deploy VMs in smaller batches.

Show
If you simultaneously clone more than 1000 VMs using the provided DataSources in the openshift-virtualization-os-images namespace, it is possible that not all of the VMs will move to a running state. (BZ#2216038) As a workaround, deploy VMs in smaller batches.
Release Note Type:
Known Issue
Release Note Status:
Done
Market:

Sprint:
Storage Core Sprint 240, Storage Core Sprint 241, Storage Core Sprint 242, Storage Core Sprint 243, Storage Core Sprint 246, Storage Core Sprint 247
Severity:
Important

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

Frequently when creating virtual machines we see they are stuck on pending forever. During investigation we saw datavolumes that they fail to create and are stuck forever with no status update. When looking at logs in the cdi-deployment pod in openshift-cnv we see logs like this:
```

{"level":"error","ts":<timestamp>,"logger":"controller.datavolume-controller","msg":"Reconciler error", "name":<dv name>,"namespace":<namespace>,"error":"error verifying token: square/go-jose/jwt: validation failed, token is expired (exp)","errorVerbose":"square/go-jose/jwt validation failed, token is expired\nerror verifying token\nkubevirt.io/containerized-data-importer/pkg/controller/.validateCloneTokenDW <snip>"}

```
After some investigation we found that the token in the DV annotations:
metadata:
annotations:
cdi.kubevirt.io/storage.clone.token: <jwt>

The JWT is in fact expired, and is only valid for 5 minutes.
The status of the DV is just empty like this:
```
status: {}
```

This doesn't reproduce all the time.

In the initial set of logs it took a little more than 8 minutes from when the DV was created (as can bee seen in the creationTimestamp) until the first log. 2023-06-05T06:44:58Z -> 1685948026.3881886 (2023-06-05T06:53:46Z)
Regarding the time sync issue suggested, I verified there is no difference in the time between different nodes in the cluster, and they are all connected to the same NTP server.
The logs from this time are already attached, including the datavolume yaml.

dv-expire.tar.gz/dv.yaml: creationTimestamp: "2023-06-05T06:44:58Z"

===
cdi-extended.tar.gz/cdi-deployment.log
===

{"level":"info","ts":1685948026.3881886,"logger":"controller.datavolume-controller","msg":"Initializing transfer","Datavolume":"mongodb/affected-vm-1-rootdisk"}

{"level":"error","ts":1685948026.3953767,"logger":"controller.datavolume-controller","msg":"Reconciler error","name":"affected-vm-1-rootdisk","namespace":"mongodb","error":"error verifying token: square/go-jose/jwt: validation failed, token is expired (exp)","errorVerbose":"square/go-jose/jwt: validation failed, token is expired (exp)\nerror verifying token\nkubevirt.io/containerized-data-importer/pkg/controller.validateCloneTokenDV\n\t/remote-source/app/pkg/controller/util.go:876\nkubevirt.io/containerized-data-importer/pkg/controller.(*DatavolumeReconciler).initTransfer\n\t/remote-source/app/pkg/controller/datavolume-controller.go:1156\nkubevirt.io/containerized-data-importer/pkg/controller.(*DatavolumeReconciler).doCrossNamespaceClone\n\t/remote-source/app/pkg/controller/datavolume-controller.go:896\nkubevirt.io/containerized-data-importer/pkg/controller.(*DatavolumeReconciler).reconcileSmartClonePvc\n\t/remote-source/app/pkg/

Customer then managed to reproduce this issue in a pre prod online environment, by creating a few hundred VMs. A 100 of the VMs have a datavolume configuration that doesn't work - it tries to copy the PVC from a different storage class. The rest of the VMs are completely regular - they should be created normally, but as can be seen there are 292 datavolumes that took more than 5 minutes to be acknowledged and are stuck in limbo since the JWT expired.

For the attached to the case we have CNV must gather, as well as openshift mustgather from the reproduction.

external trackers

CEE GitLab cpaas-midstream/openshift-virtualization/containerized-data-importer/merge_requests/734

Github kubevirt/containerized-data-importer/pull/2862

Github kubevirt/containerized-data-importer/pull/2867

PnT-DevOps Jira CNV-28630

Red Hat Customer Portal 03529328

links to

RHEA-2023:125070 OpenShift Virtualization 4.14.2 Images

RHEA-2024:125986 OpenShift Virtualization 4.14.3 Images

(2 links to)