-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
0
-
False
-
-
False
-
?
-
rhos-ops-day1day2-upgrades
-
None
-
-
-
-
Critical
We are performing a large-scale OpenStack adoption from RHOSP 17.1 to RHOSO 18.0 in an environment with 250+ nodes and 10k+ VMs.
During data plane adoption, while running the tripleo-cleanup service across compute nodes, few existing running instances were unexpectedly terminated. The affected VMs were not part of the cleanup operation and are in ACTIVE state at the time.
On the compute nodes, libvirt logs show the corresponding qemu-kvm processes receiving SIGTERM (signal 15), after which the instances transitioned to a forceful shutdown. No user initiated actions were performed on these instances.
This behavior occurs during execution of the tripleo-cleanup service and results in unexpected workload disruption during adoption. We have observed that ~7VMs out of 10k got terminated during this time.
Logs for one of the terminated instances:
// vm got a signal 15 from pid 93929 [root@computer660-63 ~]# head -20 /var/log/containers/libvirt/qemu/instance-0000109a.log 2026-01-11T16:50:57.629336Z qemu-kvm: terminating on signal 15 from pid 93929 (<unknown process>)2026-01-11 16:50:57.881+0000: shutting down, reason=shutdown // pid 93929 has selinux context "container_runtime_t" [root@computer660-63 ~]# grep "93929" /var/log/audit/audit.log | head -20 type=AVC msg=audit(1768150257.628:1250469): avc: denied { search } for pid=93649 comm="qemu-kvm" name="93929" dev="proc" ino=489139156 scontext=system_u:system_r:svirt_t:s0:c83,c933 tcontext=unconfined_u:unconfined_r:container_runtime_t:s0 tclass=dir permissive=0 // more logs [root@computer660-63 ~]# sudo grep "instance-0000109a" /var/log/containers/libvirt/virtqemud.log 2026-01-11 16:50:57.629+0000: 93655: debug : qemuProcessHandleShutdown:590 : Transitioned guest instance-0000109a to shutdown state 2026-01-11 16:50:57.630+0000: 93655: debug : qemuProcessKill:8247 : vm=0x7fe088022af0 name=instance-0000109a pid=93649 flags=0x2 2026-01-11 16:50:57.680+0000: 93655: debug : qemuMonitorIO:541 : Error on monitor <null> mon=0x7fe0a40285d0 vm=0x7fe088022af0 name=instance-0000109a 2026-01-11 16:50:57.680+0000: 93655: debug : qemuMonitorIO:563 : Triggering EOF callback mon=0x7fe0a40285d0 vm=0x7fe088022af0 name=instance-0000109a 2026-01-11 16:50:57.680+0000: 93655: debug : qemuProcessHandleMonitorEOF:310 : Received EOF on 0x7fe088022af0 'instance-0000109a' 2026-01-11 16:50:57.680+0000: 920512: debug : qemuProcessKill:8247 : vm=0x7fe088022af0 name=instance-0000109a pid=93649 flags=0x1 2026-01-11 16:50:57.881+0000: 920512: debug : qemuProcessStop:8331 : Shutting down vm=0x7fe088022af0 name=instance-0000109a id=6 pid=93649, reason=shutdown, asyncJob=none, flags=0x0 2026-01-11 16:50:57.881+0000: 920512: debug : qemuDomainLogAppendMessage:7108 : Append log message (vm='instance-0000109a' message='2026-01-11 16:50:57.881+0000: shutting down, reason=shutdown 2026-01-11 16:50:57.883+0000: 920512: debug : qemuProcessKill:8247 : vm=0x7fe088022af0 name=instance-0000109a pid=93649 flags=0x5 2026-01-11 16:50:57.883+0000: 920512: debug : qemuDomainCleanupRun:7558 : driver=0x7fe0680212d0, vm=instance-0000109a
Versions:
[root@computer660-63 ~]# rpm -qa podman podman-4.4.1-22.el9_2.4.x86_64 [root@computer660-63 ~]# rpm -qa conmon conmon-2.1.7-1.el9_2.x86_64 [root@computer660-63 ~]# rpm -qa crun crun-1.8.4-1.el9_2.x86_64 [tripleo-admin@computer660-63 ~]$ uname -r 5.14.0-284.144.1.el9_2.x86_64
Actual results:
Intermittent VM terminations during tripleo-cleanup service execution
Expected results:
No disruption to the workload.