-
Bug
-
Resolution: Unresolved
-
Blocker
-
CNV v4.18.13
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
Release Notes
-
virt-handler fails to create virt-launcher client connection after upgrading from 4.17.27 to 4.18.13
-
Known Issue
-
Proposed
-
-
Critical
-
Customer Reported
-
None
Description of problem:
After upgrading the OpenShift Virtualization operator from 4.17.27 to 4.18.13(latest version in stable channel), the virt-handler fails to communicate with the virt-launcher with following error:
{"component":"virt-handler","kind":"VirtualMachineInstance","level":"error","msg":"Synchronizing the VirtualMachineInstance failed.","name":"rhel9-blue-crawdad-23","namespace":"nijin-cnv","pos":"vm.go:2154","reason":"unable to create virt-launcher client connection: can not add ghost record when entry already exists with differing socket file location","timestamp":"2025-09-25T09:55:09.747376Z","uid":"899d636f-9c2b-4e4c-ac48-9dd008c42bd7"}
Following is the code block throwing this error:
if ok && record.SocketFile != socketFile { return fmt.Errorf("can not add ghost record when entry already exists with differing socket file location") }
Here the record.SocketFile is one from the ghost file and other is derived from the "active Pod" UUID from the VMI spec and then by using this UUID to find the socket file path under "/var/lib/kubelet". And both of them matches here:
# cat /var/run/kubevirt-private/ghost-records/899d636f-9c2b-4e4c-ac48-9dd008c42bd7 {"name":"rhel9-blue-crawdad-23","namespace":"nijin-cnv","socketFile":"/pods/fb404e62-1665-4498-be76-d6d01263860e/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","uid":"899d636f-9c2b-4e4c-ac48-9dd008c42bd7"}
# oc get vmi rhel9-blue-crawdad-23 -o yaml |yq '.status.activePods' fb404e62-1665-4498-be76-d6d01263860e: openshift-worker-leo-0 # openshift-worker-leo-0 ~]# ls -laR /var/lib/kubelet/ |grep -A 4 "/pods/fb404e62-1665-4498-be76-d6d01263860e/volumes/kubernetes.io~empty-dir/sockets" /var/lib/kubelet/pods/fb404e62-1665-4498-be76-d6d01263860e/volumes/kubernetes.io~empty-dir/sockets: total 0 drwxrwsrwx. 2 root 107 27 Sep 25 08:46 . drwxr-xr-x. 9 root root 140 Sep 25 08:46 .. srwxr-xr-x. 1 107 107 0 Sep 25 08:46 launcher-sock
But the comparison is still failing.
The problem is because of difference in how 4.17.27 and 4.18.13 saves ghost file data. 4.17.27 uses a single slash (/pods/), whereas version 4.18.13 uses a double slash (//pods/)
Newly started VM in 4.18.13:
# cat /var/run/kubevirt-private/ghost-records/467f0c19-33d5-4a7e-af0d-ea0e9ec29047 {"name":"rhel9-apricot-mastodon-79","namespace":"nijin-cnv","socketFile":"//pods/260a2291-0a05-48f5-8b64-2f320d84d4c8/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","uid":"467f0c19-33d5-4a7e-af0d-ea0e9ec29047"}
Difference is because of a recent patch applied in 4.17.27 [1] . However, this hasn’t landed in the 4.18 stable channel, where the latest version is 4.18.13. I can see this change [2] in 4.18.16, which is available in the candidate channel. I also cannot reproduce the problem if I change the virt-handler image version to 4.18.16.
[1] https://github.com/kubevirt/kubevirt/pull/15522/commits/07795d383a0f882d6525a196c3f4bd80fcbec35e
[2] https://github.com/kubevirt/kubevirt/pull/15418/commits/8d5ea940a29a60f0e3e6e31d05ba5a922ad89d89
Version-Release number of selected component (if applicable):
OpenShift Virtualization 4.18.13
How reproducible:
100%
Steps to Reproduce:
1. Create a OpenShift Virtualization 4.17.27 cluster and start few VMs. 2. Upgrade the cluster to the latest 4.18 which is 4.18.13. 3. After the upgrade, VM live migration will fail. Check the virt-handler logs for error "unable to create virt-launcher client connection: can not add ghost record when entry already exists with differing socket file location"
Actual results:
All the user actions like VM live migration, shutdown of VMs will fail because of this issue since virt-handler communication with virt-launcher is broken. This causes delays in tight upgrade windows and need manual intervention to solve the issue.
Expected results:
Additional info:
As mentioned above, it should get fixed in the next OpenShift Virtualization stable release. However, since we have a customer affected(and we may see more customers), it will be good to have an ack from engineering and QE verification.
- links to
- mentioned on