Ping a Specific Port

Question

Rick Koshi

Asked: 2012-02-21 17:16:42 +0800 CST2012-02-21 17:16:42 +0800 CST 2012-02-21 17:16:42 +0800 CST

qemu-kvm virtual machine virtio network freeze under load

772

I'm having a problem with my virtual machines, where the network will freeze under heavy load. I'm using CentOS 6.2 as both host and guest, not using libvirt, just running qemu-kvm directly as follows:

/usr/libexec/qemu-kvm \
   -drive file=/data2/vm/rb-dev2-www1-vm.img,index=0,media=disk,cache=none,if=virtio \
   -boot order=c \
   -m 2G \
   -smp cores=1,threads=2 \
   -vga std \
   -name rb-dev2-www1-vm \
   -vnc :84,password \
   -net nic,vlan=0,macaddr=52:54:20:00:00:54,model=virtio \
   -net tap,vlan=0,ifname=tap84,script=/etc/qemu-ifup \
   -monitor unix:/var/run/vm/rb-dev2-www1-vm.mon,server,nowait \
   -rtc base=utc \
   -device piix3-usb-uhci \
   -device usb-tablet

/etc/qemu-ifup (used by the above command) is a very simple script, containing the following:

#!/bin/sh

sudo /sbin/ifconfig $1 0.0.0.0 promisc up
sudo /usr/sbin/brctl addif br0 $1
sleep 2

And here's the info on br0 and other interfaces:

avl-host3 14# brctl show
bridge name     bridge id               STP enabled     interfaces
br0             8000.180373f5521a       no              bond0
                                                        tap84
virbr0          8000.525400858961       yes             virbr0-nic
avl-host3 15# ip addr show 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 18:03:73:f5:52:1a brd ff:ff:ff:ff:ff:ff
3: em2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 18:03:73:f5:52:1a brd ff:ff:ff:ff:ff:ff
4: em3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 18:03:73:f5:52:1e brd ff:ff:ff:ff:ff:ff
5: em4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 18:03:73:f5:52:20 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 18:03:73:f5:52:1a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1a03:73ff:fef5:521a/64 scope link 
       valid_lft forever preferred_lft forever
7: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 18:03:73:f5:52:1a brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.46/24 brd 172.16.1.255 scope global br0
    inet6 fe80::1a03:73ff:fef5:521a/64 scope link 
       valid_lft forever preferred_lft forever
8: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 52:54:00:85:89:61 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
9: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 500
    link/ether 52:54:00:85:89:61 brd ff:ff:ff:ff:ff:ff
12: tap84: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 500
    link/ether ba:e8:9b:2a:ff:48 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::b8e8:9bff:fe2a:ff48/64 scope link 
       valid_lft forever preferred_lft forever

bond0 is a bond of em1 and em2.

virbr0 and virbr0-nic are vestigial interfaces left over from CentOS's default installation. They are unused (as far as I know).

The guest runs perfectly until I run a large 'rsync', when the network will freeze after some seemingly-random time (usually under a minute). When it freezes, there is no network activity in or out of the guest. I can still connect to the guest's console via vnc, but it is unable to speak out its network interface. Any attempt to 'ping' from the guest gives a "Destination Host Unreachable" error for 3/4 packets and no reply for every fourth packet.

Sometimes (perhaps two thirds of the time), I can bring the interface back to life by doing a "service network restart" from the guest's console. If this works (and if I do it before the rsync times out), the rsync will resume. Usually it will freeze again within a minute or two. If I repeat, the rsync will eventually finish, and I presume the machine goes back to waiting for another period of heavy load.

Throughout the whole process, there are no console errors or relevant (that I can see) syslog messages on either guest or host machine.

If the "service network restart" doesn't work the first time, trying again (and again and again) never seems to work. The command completes normally, with normal output, but the interface stays frozen. However, a soft reboot of the guest machine (without restarting qemu-kvm) always seems to bring it back.

I am aware of the "lowest mac address" assignment problem, where the bridge takes on the mac address of the slave interface with the lowest mac address. This causes temporary network freezes, but is definitely not what's happening for me. My freezes are permanent until manual intervention, and you can see from the 'ip addr show' output above that the mac address being used by br0 is that of the physical ethernet.

There are no other virtual machines running on the host. I've verified that each virtual machine on the subnet has its own unique mac address.

I have rebuilt the guest machine several times, and I have tried this on three different host machines (identical hardware, built identically). Oddly, I do have one virtual host (the second of this series) which never seemed to have a problem. It never had its network freeze when it was running the same rsync during its build. It's particularly odd because it was the second build. The first, on a different host, did have the freezing problem, but the second did not. I assumed at the time that I had done something wrong with the first build, and that the problem was resolved. Unfortunately, the problem reappeared when I built the third VM. Also unfortunately, I can't do many tests with the working VM, as it's now in production use, and I'm hoping I can find the cause of this issue before that machine starts having problems. It's possible that I just got really lucky while running the rsync on the working machine, and that one time it didn't freeze.

Of course it's possible that I somehow changed the build scripts without realizing it and re-broke something, but I can't find any such thing.

In any case, I'm hoping someone has some idea what could cause this.

Addendum: Preliminary tests suggest that I don't have the problem if I substitute e1000 for virtio in the first -net flag to qemu-kvm. I don't consider this a solution, but it is suitable for a stopgap. Has anyone else had (or better yet, solved) this problem with the virtio network driver?

1 Answers

Voted

inaki · Answer 1 · 2012-02-22T04:50:58+08:00

I'm experiencing a similar problem running qemu kvm on a debian machine (I am running it through libvirt though). I triggered the nic freeze by cloning a disk over ftp towards one of the 3 vm's running on this host, only the vm in question seems to be affected. The other 2 vm's and the host keep on working fine. To me it also seems like virtio is causing the freezing.

host kernel (Debian Lenny 5.0.6):
Linux host_machine_1 2.6.32-bpo.5-amd64 #1 SMP Thu Oct 21 10:02:18 UTC 2010 x86_64 GNU/Linux

guest kernel (Ubuntu Hardy Heron 8.04 LTS):
Linux virtual_machine_1 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64 GNU/Linux

syslog guest:

Feb 21 09:00:22 virtual_machine_1 kernel: [63114.151904] swapper: page allocation failure. order:1, mode:0x4020
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.151919] Pid: 0, comm: swapper Not tainted 2.6.24-26-server #1
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.151920] 
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.151921] Call Trace:
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.151925]    [__alloc_pages+0x2fd/0x3d0] __alloc_pages+0x2fd/0x3d0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152256]  [new_slab+0x220/0x260] new_slab+0x220/0x260
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152260]  [__slab_alloc+0x2f5/0x410] __slab_alloc+0x2f5/0x410
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152281]  [virtio_net:__netdev_alloc_skb+0x2b/0x2eb0] __netdev_alloc_skb+0x2b/0x50
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152285]  [virtio_net:__netdev_alloc_skb+0x2b/0x2eb0] __netdev_alloc_skb+0x2b/0x50
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152287]  [__kmalloc_node_track_caller+0x121/0x130] __kmalloc_node_track_caller+0x121/0x130
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152290]  [ipv6:__alloc_skb+0x7b/0x4f0] __alloc_skb+0x7b/0x160
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152293]  [virtio_net:__netdev_alloc_skb+0x2b/0x2eb0] __netdev_alloc_skb+0x2b/0x50
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152312]  [virtio_net:try_fill_recv+0x61/0x1b0] :virtio_net:try_fill_recv+0x61/0x1b0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152336]  [ktime_get_ts+0x1b/0x50] ktime_get_ts+0x1b/0x50
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152341]  [virtio_net:virtnet_poll+0x18c/0x350] :virtio_net:virtnet_poll+0x18c/0x350
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152346]  [tick_program_event+0x35/0x60] tick_program_event+0x35/0x60
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152355]  [net_rx_action+0x128/0x230] net_rx_action+0x128/0x230
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152358]  [virtio_net:skb_recv_done+0x2c/0x40] :virtio_net:skb_recv_done+0x2c/0x40
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152369]  [__do_softirq+0x75/0xe0] __do_softirq+0x75/0xe0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152379]  [call_softirq+0x1c/0x30] call_softirq+0x1c/0x30
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152386]  [do_softirq+0x35/0x90] do_softirq+0x35/0x90
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152389]  [irq_exit+0x88/0x90] irq_exit+0x88/0x90
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152391]  [do_IRQ+0x80/0x100] do_IRQ+0x80/0x100
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152393]  [default_idle+0x0/0x40] default_idle+0x0/0x40
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152395]  [default_idle+0x0/0x40] default_idle+0x0/0x40
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152396]  [ret_from_intr+0x0/0x0a] ret_from_intr+0x0/0xa
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152398]    [default_idle+0x29/0x40] default_idle+0x29/0x40
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152404]  [cpu_idle+0x48/0xe0] cpu_idle+0x48/0xe0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152471]  [start_kernel+0x2c5/0x350] start_kernel+0x2c5/0x350
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152475]  [x86_64_start_kernel+0x12e/0x140] _sinittext+0x12e/0x140
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152482] 
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152483] Mem-info:
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152484] Node 0 DMA per-cpu:
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152486] CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152487] Node 0 DMA32 per-cpu:
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152489] CPU    0: Hot: hi:  186, btch:  31 usd: 122   Cold: hi:   62, btch:  15 usd:  55
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152492] Active:35252 inactive:200609 dirty:11290 writeback:193 unstable:0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152492]  free:1597 slab:11996 mapped:2986 pagetables:3395 bounce:0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152494] Node 0 DMA free:3988kB min:40kB low:48kB high:60kB active:1320kB inactive:4128kB present:10476kB pages_scanned:0 all_unreclaimable? no
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152497] lowmem_reserve[]: 0 994 994 994
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152499] Node 0 DMA32 free:2400kB min:4012kB low:5012kB high:6016kB active:139688kB inactive:798308kB present:1018064kB pages_scanned:0 all_unreclaimable? no
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152502] lowmem_reserve[]: 0 0 0 0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152504] Node 0 DMA: 3*4kB 1*8kB 0*16kB 0*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3988kB
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152509] Node 0 DMA32: 412*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2400kB
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152514] Swap cache: add 188, delete 187, find 68/105, race 0+0
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152516] Free swap  = 3084140kB
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152517] Total swap = 3084280kB
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.152517] Free swap:       3084140kB
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.158388] 262139 pages of RAM
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.158390] 4954 reserved pages
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.158391] 269600 pages shared
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.158392] 1 pages swap cached
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.158461] swapper: page allocation failure. order:1, mode:0x4020
Feb 21 09:00:22 virtual_machine_1 kernel: [63114.158464] Pid: 0, comm: swapper Not tainted 2.6.24-26-server #1

Guest config for qemu:

<domain type='kvm'>  
  <name>virtual_machine_1</name>  
  <uuid>41c1bf76-2aaa-3b32-8868-f28748db750a</uuid>  
  <memory>2097152</memory>  
  <currentMemory>2097152</currentMemory>  
  <vcpu>1</vcpu>  
  <os>  
    <type arch='x86_64' machine='pc'>hvm</type>  
    <boot dev='hd'/>  
  </os>  
  <features>  
    <acpi/>  
    <apic/>  
    <pae/>  
  </features>  
  <clock offset='utc'/>  
  <on_poweroff>destroy</on_poweroff>  
  <on_reboot>restart</on_reboot>  
  <on_crash>restart</on_crash>  
  <devices>  
    <emulator>/usr/bin/kvm</emulator>  
    <disk type='block' device='disk'>  
      <driver name='qemu'/>  
      <source dev='/dev/drbd1'/>  
      <target dev='hda' bus='ide'/>  
      <address type='drive' controller='0' bus='0' unit='0'/>  
    </disk>  
    <disk type='block' device='cdrom'>  
      <driver name='qemu'/>  
      <target dev='hdc' bus='ide'/>  
      <readonly/>  
      <address type='drive' controller='0' bus='1' unit='0'/>  
    </disk>  
    <controller type='ide' index='0'/>  
    <interface type='bridge'>  
      <mac address='52:54:00:2d:95:e5'/>  
      <source bridge='br0'/>  
      <model type='virtio'/>  
    </interface>  
    <serial type='pty'>  
      <target port='0'/>  
    </serial>  
    <console type='pty'>  
      <target port='0'/>  
    </console>  
    <input type='mouse' bus='ps2'/>  
    <graphics type='vnc' port='-1' autoport='yes' keymap='en-us'/>  
    <video>  
      <model type='cirrus' vram='9216' heads='1'/>  
    </video>  
  </devices>  
</domain>

kvm command:

/usr/bin/kvm -S -M pc  
-enable-kvm  
-m 2048  
-smp 1,sockets=1,cores=1,threads=1  
-name virtual_machine_1  
-uuid 41c1bf76-2aaa-3b32-8868-f28748db750a  
-nodefaults  
-chardev socket,id=monitor,path=/var/lib/libvirt/qemu/virtual_machine_1.monitor,server,nowait  
-mon chardev=monitor,mode=readline -rtc base=utc  
-boot c -drive file=/dev/drbd1,if=none,id=drive-ide0-0-0,boot=on,format=raw  
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0  
-drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0  
-device virtio-net-pci,vlan=0,id=net0,mac=52:54:00:2d:95:e5,bus=pci.0,addr=0x3  
-net tap,fd=17,vlan=0,name=hostnet0  
-chardev pty,id=serial0  
-device isa-serial,chardev=serial0 -usb  
-vnc 0.0.0.0:1  
-k en-us  
-vga cirrus  
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

This post seems to be related:
http://www.mail-archive.com/kvm@vger.kernel.org/msg26033.html

This patch is also mentioned (I haven't tested it yet but it should solve the problem):
http://www.mail-archive.com/kvm@vger.kernel.org/msg26279.html

qemu-kvm virtual machine virtio network freeze under load

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?