Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-79425

[4.20] VM snapshot shows "QuiesceFailed" if the freeze operation takes more than 5 seconds in Windows VMs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • None
    • CNV v4.18.17
    • None
    • Important
    • Customer Escalated, Customer Reported
    • None

      Description of problem:

      If there are high i/o activity in Windows VMs, the filesystem freeze may take more than 5 seconds to complete. In this case, the vmsnapshot is marked as "QuiesceFailed" although the freeze was successful later:

      The virt-controller logs have following events at the time of event:

       

      2025-12-04T19:36:20.443068912Z {"component":"virt-controller","level":"info","msg":"Freeze VMI win2k19-fuchsia-pig-14","pos":"virtualmachineinstance_expansion.go:144","timestamp":"2025-12-04T19:36:20.414101Z"}
      2025-12-04T19:36:25.635087403Z {"component":"virt-controller","level":"info","msg":"Freezing vmi win2k19-fuchsia-pig-14 took 5.220903423s","pos":"util.go:44","timestamp":"2025-12-04T19:36:25.634982Z"}
       
      2025-12-04T19:36:25.635087403Z {"component":"virt-controller","level":"error","msg":"Failed freezing vm win2k19-fuchsia-pig-14: Internal error occurred: unexpected return code 400 (400 Bad Request), message: server error. command Freeze failed: \"LibvirtError(Code=86, Domain=10, Message='Guest agent is not responding: Guest agent not available for now')\"","pos":"source.go:353","timestamp":"2025-12-04T19:36:25.635052Z"}
      

      Following is the virt-launcher logs reporting the failure:

       

      2025-12-04T19:36:20.601184269Z {"component":"virt-launcher","level":"info","msg":"dom=0x7f75ac005ae0, (VM: name=nijin-cnv_win2k19-fuchsia-pig-14, uuid=a85f2e53-9331-5a89-a65a-e896c2e71e69), cmd={\"execute\":\"guest-fsfreeze-freeze\"}, timeout=-1, flags=0x0","pos":"virDomainQemuAgentCommand:292","subcomponent":"libvirt","thread":"33","timestamp":"2025-12-04T19:36:20.601000Z"}
      2025-12-04T19:36:20.603434503Z {"component":"virt-launcher","level":"info","msg":"Send command '{\"execute\":\"guest-fsfreeze-freeze\"}' for write, seconds = -1","pos":"qemuAgentCommandFull:1033","subcomponent":"libvirt","thread":"33","timestamp":"2025-12-04T19:36:20.603000Z"}
       
      2025-12-04T19:36:25.621186672Z {"component":"virt-launcher","level":"error","msg":"Failed to freeze vmi, virError(Code=86, Domain=10, Message='Guest agent is not responding: Guest agent not available for now')","pos":"manager.go:1759","timestamp":"2025-12-04T19:36:25.621072Z"}
      

      Immediately after 3 seconds, the "guest-fsfreeze-status" shows the freeze status as successful and returned "frozen":

      2025-12-04T19:36:28.124403964Z {"component":"virt-launcher","level":"info","msg":"Send command '{\"execute\":\"guest-fsfreeze-status\"}' for write, seconds = -1","pos":"qemuAgentCommandFull:1033","subcomponent":"libvirt","thread":"35","timestamp":"2025-12-04T19:36:28.124000Z"}
      
      2025-12-04T19:36:28.124878268Z {"component":"virt-launcher","level":"info","msg":"Line [{\"return\": \"frozen\"}]","pos":"qemuAgentIOProcessLine:212","subcomponent":"libvirt","thread":"94","timestamp":"2025-12-04T19:36:28.124000Z"}

      The qemu-ga logs also shows successful freeze event:

       

      1764876980.587270: debug: disabling command: guest-get-load
      QEMU Guest Agent VSS Provider[4504]: requester_freeze begin
      QEMU Guest Agent VSS Provider[4504]: get_vss_backup_type begin
      QEMU Guest Agent VSS Provider[4504]: get_reg_dword_value begin
      QEMU Guest Agent VSS Provider[4504]: get_reg_dword_value end
      QEMU Guest Agent VSS Provider[4504]: get_vss_backup_type end
      ....
      .....
      QEMU Guest Agent VSS Provider[4504]: requester_freeze do snapshot set
      QEMU Guest Agent VSS Provider[4300]: DllMain begin, reason = 2
      QEMU Guest Agent VSS Provider[4300]: DllMain end
      QEMU Guest Agent VSS Provider[4504]: requester_freeze end successful    <===
      QEMU Guest Agent VSS Provider[4504]: requester_thaw begin 
      QEMU Guest Agent VSS Provider[4504]: WaitForAsync begin
      1764876994.447025: warning: logging re-enabled due to filesystem unfreeze 1764876994.447025: debug: enabling command: guest-get-time
      

      The unfreeze was also successful:

       

      2025-12-04T19:36:34.216719147Z {"component":"virt-launcher","level":"info","msg":"dom=0x7f75b00035a0, (VM: name=nijin-cnv_win2k19-fuchsia-pig-14, uuid=a85f2e53-9331-5a89-a65a-e896c2e71e69), cmd={\"execute\":\"guest-fsfreeze-thaw\"}, timeout=-1, flags=0x0","pos":"virDomainQemuAgentCommand:292","subcomponent":"libvirt","thread":"34","timestamp":"2025-12-04T19:36:34.216000Z"}
      2025-12-04T19:36:34.470180474Z {"component":"virt-launcher","level":"info","msg":"Line [{\"return\": \"thawed\"}]","pos":"qemuAgentIOProcessLine:212","subcomponent":"libvirt","thread":"94","timestamp":"2025-12-04T19:36:34.470000Z"}
      

      However, the VM snapshot have status "QuiesceFailed":

       % oc get vmsnapshot win2k19-fuchsia-pig-14-snapshot-20251205-010539 -o yaml |yq '.status.indications'
      - GuestAgent
      - Online
      - QuiesceFailed

      Looks like the default QEMU_AGENT_WAIT_TIME is not enough for the freeze request.

      Version-Release number of selected component (if applicable):

      OpenShift Virtualization 4.18.17

      How reproducible:

      Observed in customer environment and reproducible in the lab

      Steps to Reproduce:

      1. Used the attached Powershell script to do some parallel writes.
      
      & C:\IOLoadTestParallel.ps1 -FileSizeMB 100 -DurationSeconds 300
       
      2. Took the snapshot of the Windows VM. The vmsnapshot status was "QuiesceFailed"
      

      Actual results:

      VM snapshot shows "QuiesceFailed" if the freeze takes more than 5 seconds in Windows VMs

      Expected results:

      Since the freeze was successful later, I think we should initiate the freeze request with a longer timeout.

      Additional info:

      Attaching virt-controller, virt-launcher and qemu-ga logs

              rh-ee-nassouli Noam Assouline
              rhn-support-nashok Nijin Ashok
              Natalie Gavrielov Natalie Gavrielov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: