-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
CNV v4.18.17
-
None
-
Quality / Stability / Reliability
-
5
-
False
-
-
False
-
None
-
-
CNV Storage Sprint 283, CNV Storage Sprint 284, CNV Storage Sprint 285
-
Important
-
Customer Escalated, Customer Reported
-
None
Description of problem:
If there are high i/o activity in Windows VMs, the filesystem freeze may take more than 5 seconds to complete. In this case, the vmsnapshot is marked as "QuiesceFailed" although the freeze was successful later:
The virt-controller logs have following events at the time of event:
2025-12-04T19:36:20.443068912Z {"component":"virt-controller","level":"info","msg":"Freeze VMI win2k19-fuchsia-pig-14","pos":"virtualmachineinstance_expansion.go:144","timestamp":"2025-12-04T19:36:20.414101Z"}
2025-12-04T19:36:25.635087403Z {"component":"virt-controller","level":"info","msg":"Freezing vmi win2k19-fuchsia-pig-14 took 5.220903423s","pos":"util.go:44","timestamp":"2025-12-04T19:36:25.634982Z"}
2025-12-04T19:36:25.635087403Z {"component":"virt-controller","level":"error","msg":"Failed freezing vm win2k19-fuchsia-pig-14: Internal error occurred: unexpected return code 400 (400 Bad Request), message: server error. command Freeze failed: \"LibvirtError(Code=86, Domain=10, Message='Guest agent is not responding: Guest agent not available for now')\"","pos":"source.go:353","timestamp":"2025-12-04T19:36:25.635052Z"}
Following is the virt-launcher logs reporting the failure:
2025-12-04T19:36:20.601184269Z {"component":"virt-launcher","level":"info","msg":"dom=0x7f75ac005ae0, (VM: name=nijin-cnv_win2k19-fuchsia-pig-14, uuid=a85f2e53-9331-5a89-a65a-e896c2e71e69), cmd={\"execute\":\"guest-fsfreeze-freeze\"}, timeout=-1, flags=0x0","pos":"virDomainQemuAgentCommand:292","subcomponent":"libvirt","thread":"33","timestamp":"2025-12-04T19:36:20.601000Z"}
2025-12-04T19:36:20.603434503Z {"component":"virt-launcher","level":"info","msg":"Send command '{\"execute\":\"guest-fsfreeze-freeze\"}' for write, seconds = -1","pos":"qemuAgentCommandFull:1033","subcomponent":"libvirt","thread":"33","timestamp":"2025-12-04T19:36:20.603000Z"}
2025-12-04T19:36:25.621186672Z {"component":"virt-launcher","level":"error","msg":"Failed to freeze vmi, virError(Code=86, Domain=10, Message='Guest agent is not responding: Guest agent not available for now')","pos":"manager.go:1759","timestamp":"2025-12-04T19:36:25.621072Z"}
Immediately after 3 seconds, the "guest-fsfreeze-status" shows the freeze status as successful and returned "frozen":
2025-12-04T19:36:28.124403964Z {"component":"virt-launcher","level":"info","msg":"Send command '{\"execute\":\"guest-fsfreeze-status\"}' for write, seconds = -1","pos":"qemuAgentCommandFull:1033","subcomponent":"libvirt","thread":"35","timestamp":"2025-12-04T19:36:28.124000Z"}
2025-12-04T19:36:28.124878268Z {"component":"virt-launcher","level":"info","msg":"Line [{\"return\": \"frozen\"}]","pos":"qemuAgentIOProcessLine:212","subcomponent":"libvirt","thread":"94","timestamp":"2025-12-04T19:36:28.124000Z"}
The qemu-ga logs also shows successful freeze event:
1764876980.587270: debug: disabling command: guest-get-load
QEMU Guest Agent VSS Provider[4504]: requester_freeze begin
QEMU Guest Agent VSS Provider[4504]: get_vss_backup_type begin
QEMU Guest Agent VSS Provider[4504]: get_reg_dword_value begin
QEMU Guest Agent VSS Provider[4504]: get_reg_dword_value end
QEMU Guest Agent VSS Provider[4504]: get_vss_backup_type end
....
.....
QEMU Guest Agent VSS Provider[4504]: requester_freeze do snapshot set
QEMU Guest Agent VSS Provider[4300]: DllMain begin, reason = 2
QEMU Guest Agent VSS Provider[4300]: DllMain end
QEMU Guest Agent VSS Provider[4504]: requester_freeze end successful <===
QEMU Guest Agent VSS Provider[4504]: requester_thaw begin
QEMU Guest Agent VSS Provider[4504]: WaitForAsync begin
1764876994.447025: warning: logging re-enabled due to filesystem unfreeze 1764876994.447025: debug: enabling command: guest-get-time
The unfreeze was also successful:
2025-12-04T19:36:34.216719147Z {"component":"virt-launcher","level":"info","msg":"dom=0x7f75b00035a0, (VM: name=nijin-cnv_win2k19-fuchsia-pig-14, uuid=a85f2e53-9331-5a89-a65a-e896c2e71e69), cmd={\"execute\":\"guest-fsfreeze-thaw\"}, timeout=-1, flags=0x0","pos":"virDomainQemuAgentCommand:292","subcomponent":"libvirt","thread":"34","timestamp":"2025-12-04T19:36:34.216000Z"}
2025-12-04T19:36:34.470180474Z {"component":"virt-launcher","level":"info","msg":"Line [{\"return\": \"thawed\"}]","pos":"qemuAgentIOProcessLine:212","subcomponent":"libvirt","thread":"94","timestamp":"2025-12-04T19:36:34.470000Z"}
However, the VM snapshot have status "QuiesceFailed":
% oc get vmsnapshot win2k19-fuchsia-pig-14-snapshot-20251205-010539 -o yaml |yq '.status.indications'
- GuestAgent
- Online
- QuiesceFailed
Looks like the default QEMU_AGENT_WAIT_TIME is not enough for the freeze request.
Version-Release number of selected component (if applicable):
OpenShift Virtualization 4.18.17
How reproducible:
Observed in customer environment and reproducible in the lab
Steps to Reproduce:
1. Used the attached Powershell script to do some parallel writes. & C:\IOLoadTestParallel.ps1 -FileSizeMB 100 -DurationSeconds 300 2. Took the snapshot of the Windows VM. The vmsnapshot status was "QuiesceFailed"
Actual results:
VM snapshot shows "QuiesceFailed" if the freeze takes more than 5 seconds in Windows VMs
Expected results:
Since the freeze was successful later, I think we should initiate the freeze request with a longer timeout.
Additional info:
Attaching virt-controller, virt-launcher and qemu-ga logs
- duplicates
-
CNV-74868 Snapshot incorrectly labeled "QuiesceFailed" when freeze succeeds after retries (False Negative)
-
- Closed
-
- is cloned by
-
CNV-79424 [4.21] VM snapshot shows "QuiesceFailed" if the freeze operation takes more than 5 seconds in Windows VMs
-
- POST
-
-
CNV-79425 [4.20] VM snapshot shows "QuiesceFailed" if the freeze operation takes more than 5 seconds in Windows VMs
-
- POST
-
-
CNV-78509 [4.19] VM snapshot shows "QuiesceFailed" if the freeze operation takes more than 5 seconds in Windows VMs
-
- Closed
-
- is duplicated by
-
CNV-74868 Snapshot incorrectly labeled "QuiesceFailed" when freeze succeeds after retries (False Negative)
-
- Closed
-
- is related to
-
CNV-78635 [CLOSED LOOP for] VM snapshot shows "QuiesceFailed" if the freeze operation takes more than 5 seconds in Windows VMs
-
- New
-
- links to