I got extremely confused and decided to re-write the question from scratch, so if some of the comments do not make sense, that is why. Apoplogies to anyone whose time I wasted.
I am having a problem with my entire system freezing when using KVM virtual machines (both with PCI passthrough). The host is Ubuntu 20.04, and this is running on a Threadripper 1950X with an ASRock x399 Taichi motherboard.
It seems that there are a lot of ways to trigger it, but one of the most reliable is to use my Kubuntu VM while playing audio in my Windows VM. It rarely if ever happens when using just the Kubuntu VM. When using the Windows VM alone, it usually happens more slowly, but happens eventually.
Probably not important, but, this entire problem started while watching a video in my Windows VM last night. The guest froze, so I forced it off, and then subsequently the guest brought down the whole system.
In any case, my strategy is to insert descriptions of what I am doing into journalctl, flagged as "###Admin note:" by using
echo '###Admin note: test' | systemd-cat
The output as viewed with journalctl -b -1
after reboot: https://pastebin.com/fi1iR8Zx
The last message I sent to journalctl does not seem to have been saved, but did appear in the terminal:
And the output of journalctl -k -b -1
https://pastebin.com/1t38WsWh
Are there any other logs I should look in? What additional information would be helpful? I was thinking to post info about my PCI devices and my virsh XML file, but I don't want to over-clutter this post with irrelevant stuff.
Edit: I notice that I have a high temperature on something called SMBUSMASTER
. So far I have found only speculation about what exactly that is. If the maximum safe temperature is the same as for CPUTIN
(68C), then I guess I have a problem.
Also, other than PCI passthrough of GPUs, one unusual feature of my system is that Windows' USB devices are all connected to a PCI USB-C card, which is passed through. I don't see anything to implicate it in the log, but seemed worth mentioning.
Possible solution?
I think I may have figuered it out.
In looking over my logs, I noticed many entries like this:
Jul 10 23:43:20 virtland kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Googling that led me to this Ryzen bug: https://bugzilla.kernel.org/show_bug.cgi?id=196683
which led me to this thread: http://forum.asrock.com/forum_posts.asp?TID=11690&title=x370-taichi-c-states
Now I could have sworn I turned off suspend to memory
long ago, but sure enough it was set to auto
. In any case, I turned it off and now I am listening to music from Windows and working in Kubuntu.
Perhaps the setting got changed back to default when I did a firmware update before I set up the host, but I still don't get why it would be a problem. Could it be related to power? We are having a heatwave, and the lights sometimes flicker. I am behind an APC UPS for exactly this reason, but it has been making a lot of clicking noises when I have my air conditioning is on. I certainly was not trying to suspend to RAM at any point, but could a power fluctuation that the UPS did not correct fast enough have somehow triggered something related to this.
In any case, I won't be confident that this is fixed until I have gone a couple days without it, but this is very promising.
Indeed I did speak too soon, here are the logs from that session. I was using Windows and Kubuntu for a few hours without issue, which was a record. The MWAIT errors are still there:
journalctl -b -1
: https://pastebin.com/1feLq19U
Edit: I noticed that the last log seems to cut off well before the crash actually happens, so I reproduced the crash one more time. The output of journalctl -b -1
after rebooting cuts off more than ten minutes before the crash. Fortunately, I was running `journalctl -f, and that actually captured everything until my last comment (the crash happened less than a minute later). Unfortunately, I don't see any errors right at the time of the crash.
I am puzzled as to why the journalctl as saved to the system logs and viewable later cut off ten minutes before the output as sent to a file live. A few seconds I can understand, but ten minutes?
Anyway, for comparison, here are the logs plus my picture of the crashed screen.
Output of
journalctl -b -1
: Cuts off 10 minutes early, but included for comparison. https://pastebin.com/mKu4bBzLText saved from
journalctl -f > file.txt
, which is complete: https://pastebin.com/kSXDRkBp
Edit: Another detail that might matter is that the Windows VM was created with a previous host installation (it was created with Arch, but with an earlier version of QEMU, so that shouldn't be a problem, right?)
Edit: Disabled C-states on my motherboard, no luck
Edit: I set up Kernel Crash Dumps as described here: https://ubuntu.com/server/docs/kernel-crash-dump There is a lot there, but one thing that caught my eye is this:
[ 463.983070] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[ 463.983308] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x1e@0x370
[ 465.131213] vfio-pci 0000:0a:00.0: vfio_ecap_init: hiding ecap 0x19@0x280
...
[ 1047.680144] vfio-pci 0000:43:00.0: enabling device (0000 -> 0003)
[ 1047.680422] vfio-pci 0000:43:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Those are the PCI ids of all the cards I am passing through (43
is to Kubuntu, 09
and 0a
to Windows). Googling hiding ecap
did find threads about problems with similar hardware to mine (AMD CPU + AMD GPU passthrough):
https://forums.gentoo.org/viewtopic-t-1070816-start-0.html
https://forum.level1techs.com/t/rx-470-stuck-in-d3-single-gpu-passthrough/147401
I passed through my RX470 to Windows and was able to shut down or start again without issue for months. I gave up on trying to make it work with Linux long ago due to the AMD reset bug. Is it possible that this bug is showing itself again in a different form?
0 Answers