View Issue Details

IDProjectCategoryView StatusLast Update
0003533Rocky-Linux-8kernelpublic2023-06-02 20:36
ReporterStuart Gathman Assigned ToLouis Abel  
PriorityurgentSeverityblockReproducibilitysometimes
Status needinfoResolutionopen 
PlatformX86_64OSRocky LinuxOS Version8.8
Summary0003533: kernel-4.18.0-477.10.1.el8_8.x86_64 locks up qemu-kvm with heavy virtio
Descriptionqemu-kvm becomes frozen and unkillable on heavy virtio with multiple devices. kill -9 is ineffective. ps -ef hangs, ps -el does not (because it doesn't need to access process memory). atop on host hangs and requires kill -9
virsh destroy doesn't work either naturally.
All other vms and host continue to work normally (unless you manage to lock up another vm).
top on host shows 'D' and 0.0 cpu for wedged qemu-kvm process

To reset:
1. Shutdown all working vms.
2. systemctl poweroff
3. There will be a long time while systemd tries to kill qemu-kvm. Finally after maybe 10mins it will time out and
report on the console that it can't unmount /oldroot (because qemu-kvm has it open for process image). There should be no
more disk activity (hopefully your server has an activity light). At this point, you can hold the power button down
to force power off.
4. Power on again

Mitigation:
Install kernel-4.18.0-425.19.2.el8_7.x86_64 as default boot. This is reported as stable. I am testing it now.

From a Fedora 36 guest:
May 29 20:35:20 matrix.gathman.org kernel: Sending NMI from CPU 1 to CPUs 0:
May 29 20:38:19 matrix.gathman.org kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
May 29 20:38:19 matrix.gathman.org kernel: rcu: 0-...0: (6 ticks this GP) idle=85fc/1/0x4000000000000000 softirq=274821528/274821529 fqs=1289617
May 29 20:38:19 matrix.gathman.org kernel: (detected by 1, t=5460152 jiffies, g=584688457, q=2153 ncpus=2)
Steps To ReproduceBoot from kernel-4.18.0-477.10.1.el8_8.x86_64 and run a qemu-kvm guest os with 2 or more virtio disks.
Do something like run a backup of 10+ G from one disk to the other.
Tagskernel, qemu-kvm

Activities

Louis Abel

Louis Abel

2023-06-02 20:31

administrator   ~0003637

This issue is occurring on an old version of -477 kernel. Please attempt to get the same results using kernel 4.18.0-477.13.1.el8_8.

Setting to need info.
Stuart Gathman

Stuart Gathman

2023-06-02 20:32

reporter   ~0003638

The journal from f36 guest was obtained after resetting the host and restarting the vm, then running journalctl -b-3 to see what the last journal entries were. (There were a few more reboots of vm to fsck filesystems, etc)

The host had a stacktrace in dmesg about when the vm froze, but I forgot to save it. :-( (Being in a hurry to get services back up)

Issue History

Date Modified Username Field Change
2023-06-02 20:25 Stuart Gathman New Issue
2023-06-02 20:25 Stuart Gathman Tag Attached: kernel
2023-06-02 20:25 Stuart Gathman Tag Attached: qemu-kvm
2023-06-02 20:31 Louis Abel Assigned To => Louis Abel
2023-06-02 20:31 Louis Abel Status new => needinfo
2023-06-02 20:31 Louis Abel Note Added: 0003637
2023-06-02 20:32 Stuart Gathman Note Added: 0003638