Loading...

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: rhel-9.4
Affects Version/s: None
Component/s: libvirt / General
Labels:

Fixed in Build:
libvirt-9.9.0-1.el9
Regression:
None
Severity:
Important
Keywords:

ZStream

Pool Team:

rhel-sst-virt-tools
Sub-System Group:

ssg_virtualization

Dev Target Milestone:
13
Internal Target Milestone:
18
Story Points:
None
ACKs Check:

QE ack, Dev ack
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Products:

Red Hat OpenShift Virtualization
Sprint:
None

Preliminary Testing:
Pass
Errata Link:
https://errata.devel.redhat.com/advisory/125049
Test Coverage:

Automated

Release Note Type:
If docs needed, set a value

Experience:
Architecture:

Unspecified
Target Upstream Version:
9.7.0
Bugzilla Bug:
RHBZ: 2173980

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

+++ This bug was initially created as a clone of Bug #2168346 +++

--------------------------------------------------
Description of problem:
--------------------------------------------------
as a part of an OOM investigation, I deliberately attempted hitting OOM on a VM by straining the memory while using a heavy IO workload,but something unexpected occurred when instead of just OOM which causes QEMU reboot the VM failed to run again:

NAME AGE STATUS READY
rhel82-vm0001 26h CrashLoopBackOff False
rhel82-vm0002 26h Running True
rhel82-vm0003 26h Stopped False

pod logs :

{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-212-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:49.591084Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:49.664614Z"} {"component":"virt-launcher","level":"error","msg":"internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long","pos":"qemuOpenChrChardevUNIXSocket:5223","subcomponent":"libvirt","thread":"30","timestamp":"2023-02-08T18:36:50.627000Z"} {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"rhel82-vm0001","namespace":"default","pos":"manager.go:880","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:50.628244Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"} {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-213-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:50.628304Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:50.663725Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel82-vm0001, open /run/libvirt/qemu/run/default_rhel82-vm0001.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-02-08T18:36:51.663684Z"} {"component":"virt-launcher","level":"error","msg":"internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long","pos":"qemuOpenChrChardevUNIXSocket:5223","subcomponent":"libvirt","thread":"29","timestamp":"2023-02-08T18:36:51.663000Z"} {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"rhel82-vm0001","namespace":"default","pos":"manager.go:880","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:51.664818Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"} {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"rhel82-vm0001","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm00 0/org.qemu.guest_agent.0' too long')","timestamp":"2023-02-08T18:36:51.664884Z","uid":"0a6404c5-2ba7-4cc0-ad2a-307018174023"}

OOM record:
[Wed Feb 8 12:19:15 2023] worker invoked oom-killer: gfp_mask=0x620100(GFP_NOIO|_GFP_HARDWALL|_GFP_WRITE), order=0, oom_score_adj=979
[Wed Feb 8 12:19:15 2023] oom_kill_process.cold.32+0xb/0x10
[Wed Feb 8 12:19:15 2023] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Feb 8 12:19:15 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-a2021e5dd93338ba5e39cef21c773838a294ab95a466c7887054e9e24f72e8e4.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6553054d_e923_4628_b36c_c6754eb6e0b1.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6553054d_e923_4628_b36c_c6754eb6e0b1.slice/crio-a2021e5dd93338ba5e39cef21c773838a294ab95a466c7887054e9e24f72e8e4.scope,task=qemu-kvm,pid=3196344,uid=107
[Wed Feb 8 12:19:15 2023] Memory cgroup out of memory: Killed process 3196344 (qemu-kvm) total-vm:64560756kB, anon-rss:58285188kB, file-rss:17672kB, shmem-rss:4kB, UID:107 pgtables:115428kB oom_score_adj:979
[Wed Feb 8 12:19:15 2023] oom_reaper: reaped process 3196344 (qemu-kvm), now anon-rss:0kB, file-rss:68kB, shmem-rss:4kB

--------------------------------------------------
Version-Release number of selected component (if applicable):
--------------------------------------------------
kubevirt-hyperconverged-operator.v4.11.3
local-storage-operator.v4.12.0-202301042354
mcg-operator.v4.11.4
ocs-operator.v4.11.4

--------------------------------------------------
How reproducible:
--------------------------------------------------

no idea but the current state is persistent throughout.

--------------------------------------------------
Steps to Reproduce:
--------------------------------------------------

1. strain the VM using a heavy-duty workload
2. reach OOM
3. repeat

--------------------------------------------------
Actual results:
--------------------------------------------------
VM no longer boot.

--------------------------------------------------
Expected results:
--------------------------------------------------
VM reboots and starts normally

--------------------------------------------------
logs:
--------------------------------------------------
I collect both must gather and the SOS report from the specific node that ran the VM
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/vm_doesnt_boot_after_oom.tar.gz

— Additional comment from Yan Du on 2023-02-15 13:14:50 UTC —

It doesn't look like storage component, move to Virt component.

— Additional comment from Jed Lejosne on 2023-02-27 18:55:41 UTC —

There's an interesting error there:
internal error: UNIX socket path '/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-214-default_rhel82-vm000/org.qemu.guest_agent.0' too long

This indeed 108 characters long, 1 more than the 107 allowed by Linux. I think "214" here is the number of times the VM rebooted.
This means VMs can only be rebooted 98 times. We need to address that.
I don't see why, as far as libvirt is concerned, VMs couldn't just be called "vm" instead of "<namespace>_<VMI name>".

— Additional comment from Jed Lejosne on 2023-02-27 19:44:51 UTC —

(In reply to Jed Lejosne from comment #2)
> [...]
> This means VMs can only be rebooted 98 times. We need to address that.

This is actually incorrect, VMs need to actually crash for that number to increase, so that's not such a big deal.
However, @bbenshab@redhat.com, please give more information on how you managed to trigger the OOM killer.
If that was solely by doing things from inside the guest, then that's a problem. No matter what guests do, that should cause virt-launcher to run out of memory...

— Additional comment from Boaz on 2023-02-28 10:29:42 UTC —

@jlejosne@redhat.com this was an investigation of a customer case that is described here:
https://docs.google.com/document/d/1bMWAkw7Scp98XgXmtVH-vRD_YOjB2xUW7KUcs8SeUSg

account is impacted by

CNV-25200 [2168346] VM stuck at CrashLoopBackOff state after it hits OOM