Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Blocker
Fix Version/s: CNV v4.18.20
Affects Version/s: CNV v4.18.13
Component/s: CNV Virt-Cluster
Labels:
None

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
CNV v4.18.14.rhel9-1
Documentation Type:

Release Notes
Release Note Text:
virt-handler fails to create virt-launcher client connection after upgrading from 4.17.27 to 4.18.13
Release Note Type:
Known Issue
Release Note Status:
Done
Git Pull Request:
https://github.com/kubevirt/kubevirt/pull/15418
Market:

Severity:
Critical
Customer Impact:

Customer Reported

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

After upgrading the OpenShift Virtualization operator from 4.17.27 to 4.18.13(latest version in stable channel), the virt-handler fails to communicate with the virt-launcher with following error:

{"component":"virt-handler","kind":"VirtualMachineInstance","level":"error","msg":"Synchronizing the VirtualMachineInstance failed.","name":"rhel9-blue-crawdad-23","namespace":"nijin-cnv","pos":"vm.go:2154","reason":"unable to create virt-launcher client connection: can not add ghost record when entry already exists with differing socket file location","timestamp":"2025-09-25T09:55:09.747376Z","uid":"899d636f-9c2b-4e4c-ac48-9dd008c42bd7"}

Following is the code block throwing this error:

if ok && record.SocketFile != socketFile {
        return fmt.Errorf("can not add ghost record when entry already exists with differing socket file location")
    }

Here the record.SocketFile is one from the ghost file and other is derived from the "active Pod" UUID from the VMI spec and then by using this UUID to find the socket file path under "/var/lib/kubelet". And both of them matches here:

# cat /var/run/kubevirt-private/ghost-records/899d636f-9c2b-4e4c-ac48-9dd008c42bd7
{"name":"rhel9-blue-crawdad-23","namespace":"nijin-cnv","socketFile":"/pods/fb404e62-1665-4498-be76-d6d01263860e/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","uid":"899d636f-9c2b-4e4c-ac48-9dd008c42bd7"}

# oc get vmi  rhel9-blue-crawdad-23  -o yaml |yq '.status.activePods'
fb404e62-1665-4498-be76-d6d01263860e: openshift-worker-leo-0

# openshift-worker-leo-0 ~]# ls -laR /var/lib/kubelet/ |grep -A 4 "/pods/fb404e62-1665-4498-be76-d6d01263860e/volumes/kubernetes.io~empty-dir/sockets"
/var/lib/kubelet/pods/fb404e62-1665-4498-be76-d6d01263860e/volumes/kubernetes.io~empty-dir/sockets:
total 0
drwxrwsrwx. 2 root  107  27 Sep 25 08:46 .
drwxr-xr-x. 9 root root 140 Sep 25 08:46 ..
srwxr-xr-x. 1  107  107   0 Sep 25 08:46 launcher-sock

But the comparison is still failing.

The problem is because of difference in how 4.17.27 and 4.18.13 saves ghost file data. 4.17.27 uses a single slash (/pods/), whereas version 4.18.13 uses a double slash (//pods/)

Newly started VM in 4.18.13:

# cat /var/run/kubevirt-private/ghost-records/467f0c19-33d5-4a7e-af0d-ea0e9ec29047
{"name":"rhel9-apricot-mastodon-79","namespace":"nijin-cnv","socketFile":"//pods/260a2291-0a05-48f5-8b64-2f320d84d4c8/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","uid":"467f0c19-33d5-4a7e-af0d-ea0e9ec29047"}

Difference is because of a recent patch applied in 4.17.27 [1] . However, this hasn’t landed in the 4.18 stable channel, where the latest version is 4.18.13. I can see this change [2] in 4.18.16, which is available in the candidate channel. I also cannot reproduce the problem if I change the virt-handler image version to 4.18.16.

[1] https://github.com/kubevirt/kubevirt/pull/15522/commits/07795d383a0f882d6525a196c3f4bd80fcbec35e

[2] https://github.com/kubevirt/kubevirt/pull/15418/commits/8d5ea940a29a60f0e3e6e31d05ba5a922ad89d89

Version-Release number of selected component (if applicable):

OpenShift Virtualization            4.18.13

How reproducible:

100%

Steps to Reproduce:

1. Create a OpenShift Virtualization 4.17.27 cluster and start few VMs.
2. Upgrade the cluster to the latest 4.18 which is 4.18.13. 
3. After the upgrade, VM live migration will fail. Check the virt-handler logs for error "unable to create virt-launcher client connection: can not add ghost record when entry already exists with differing socket file location"

Actual results:

All the user actions like VM live migration, shutdown of VMs will fail because of this issue since virt-handler communication with virt-launcher is broken. This causes delays in tight upgrade windows and need manual intervention to solve the issue.

Expected results:

Additional info:

As mentioned above, it should get fixed in the next OpenShift Virtualization stable release. However, since we have a customer affected(and we may see more customers), it will be good to have an ack from engineering and QE verification.

links to

KCS

openshift/openshift-docs#100425: CNV#70576: known issue in 4.18.13 and bug fix for same in 4.18.17

RHEA-2025:155248 OpenShift Virtualization 4.18.20 Images

mentioned on

Merge request - [CNV-70324] Add exception to block upgrade path from v4.17 to v4.18

Assignee:: Kedar Bidarkar

Reporter:: Nijin Ashok

QA Contact:: Kedar Bidarkar

Votes:: 1 Vote for this issue

Watchers:: 24 Start watching this issue

Created:: 2025/09/25 10:25 AM

Updated:: 2025/10/23 3:13 AM

Resolved:: 2025/10/22 1:09 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates