How much overhead does x86/x64 virtualization (I'll probably be using VirtualBox, possbly VMWare, definitely not paravirtualization) have for each of the following operations a Win64 host and Linux64 guest using Intel hardware virtualization?
Purely CPU-bound, user mode 64-bit code
Purely CPU-bound, user mode 32-bit code
File I/O to the hard drive (I care mostly about throughput, not latency)
Network I/O
Thread synchronization primitives (mutexes, semaphores, condition variables)
Thread context switches
Atomic operations (using the
lock
prefix, things like compare-and-swap)
I'm primarily interested in the hardware assisted x64 case (both Intel and AMD) but wouldn't mind hearing about the unassisted binary translation and x86 (i.e. 32-bit host and guest) cases, too. I'm not interested in paravirtualization.
I found that there isn't simple and absolute answer for questions like yours. Each virtualization solution behaves differently on specific performance tests. Also, tests like disk I/O throughput can be split in many different tests (read, write, rewrite, ...) and the results will vary from solution to solution, and from scenario to scenario. This is why it is not trivial to point one solution as being the fastest for disk I/O, and this is why there is no absolute answer for labels like overhead for disk I/O.
It gets more complex when trying to find relation between different benchmark tests. None of the solutions I've tested had good performance on micro-operations tests. For example: Inside VM one single call to "gettimeofday()" took, in average, 11.5 times more clock cycles to complete than on hardware. The hypervisors are optimized for real world applications and do not perform well on micro-operations. This may not be a problem for your application that may fit better as real world application. I mean by micro-operation any application that spends less than 1,000 clock cycles to finish(For a 2.6 GHz CPU, 1,000 clock cycles are spent in 385 nanoseconds, or 3.85e-7 seconds).
I did extensive benchmark testing on the four main solutions for data center consolidation for x86 archictecture. I did almost 3000 tests comparing performance inside VMs with the hardware performance. I've called 'overhead' the difference of maximum performance measured inside VM(s) with maximum performance measured on hardware.
The solutions:
The guest OSs:
Test Info:
Benchmark Software:
CPU and Memory: Linpack benchmark for both 32 and 64 bits. This is CPU and memory intensive.
Disk I/O and Latency: Bonnie++
Network I/O: Netperf: TCP_STREAM, TCP_RR, TCP_CRR, UDP_RR and UDP_STREAM
Micro-operations: rdtscbench: System calls, inter process pipe communication
The averages are calculated with the parameters:
CPU and Memory: AVERAGE(HPL32, HPL64)
Disk I/O: AVERAGE(put_block, rewrite, get_block)
Network I/O: AVERAGE(tcp_crr, tcp_rr, tcp_stream, udp_rr, udp_stream)
Micro-operations AVERAGE(getpid(), sysconf(), gettimeofday(), malloc[1M], malloc[1G], 2pipes[], simplemath[])
For my test scenario, using my metrics, the averages of the results of the four virtualization solutions are:
VM layer overhead, Linux guest:
CPU and Memory: 14.36%
Network I/O: 24.46%
Disk I/O: 8.84%
Disk latency for reading: 2.41 times slower
Micro-operations execution time: 10.84 times slower
VM layer overhead, Windows guest:
CPU and Memory average for both 32 and 64 bits: 13.06%
Network I/O: 35.27%
Disk I/O: 15.20%
Please note that those values are generic, and do not reflect the specific cases scenario.
Please take a look at the full article: http://petersenna.com/en/projects/81-performance-overhead-and-comparative-performance-of-4-virtualization-solutions
There are too many variables in your question, however I could try to narrow it down. Let's assume that you go with VMware ESX, you do everything right - latest CPU with support for virtualaization, VMware tools with paravirtualized storage and network drivers, plenty of memory. Now let's assume that you run a single virtual machine on this setup. From my experience, you should have ~90% of CPU speed for CPU bound workload. I cannot tell you much about network speeds, since we are using 1Gbps links and I can saturate it without a problem, it may be different with 10Gbps link however we do not have any of those. Storage throughput depends on type of storage, with I can get around ~80% of storage throughput with local storage, but for 1Gbps NFS it is close to 100% since networking is bottleneck here. Cannot tell about other metrics, you will need to do experimentation with your own code.
These numbers are very approximate and it highly depends on your load type, your hardware, your networking. It is getting even fuzzier when you run multiple workloads on the server. But what I'm truing to say here is that under ideal conditions you should be able to get as close as 90% of native performance.
Also from my experience the much bigger problem for high performance applications is latency and it is especially true for client server applications. We have a computation engine that receives request from 30+ clients, performs short computations and returns results. On bare metal it usually pushes CPU to 100% but same server on VMware can only load CPU to 60-80% and this is primarily because of the latency in handling requests/replies.
I haven't dug down to the performance of the basic primitives like context switching and atomic operations, but here are my results of a brute force test I carried out recently with different hypervisors. It should be indicative of what you might expect if you are mostly CPU and RAM bandwidth limited.
https://altechnative.net/virtual-performance-or-lack-thereof/