(the question was reformulated, I think it needed to be more structured)
We have a Proxmox VE on Dell PowerEdge R610 gen 8 system. The platform is old, but we use it for particular S/W which is well known having no benefits from modern CPU cores, but increases its performance linearly with CPU clock frequency, and 3.3GHz accomplishes the goal well. A performance analysis showed disk I/O is serous bottleneck, while others aren't.
HW config is:
- Dell PowerEdge R610 gen 8, BIOS v6.6.0 of 05/22/2018 (most recent), dual PSU - both seem to be OK. Server boots in UEFI.
- CPU: 2x Xeon X5680 (Nehalem, 12 cores total, 3.3GHz, boosts up to 3.6 GHz)
- RAM: 96 GiB - 6x Samsung M393B2K70DM0-YH9 (DDR3, 16GiB, 1333MT/s)
- Storage controller: LSI MegaRAID SAS 9240-4i, JBOD mode (SAS-MFI BIOS, FW v20.10.1-0107 - not the latest one)
- Storage: 2x new Samsung SSD 860 EVO 1TB, firmware RVT03B6Q
MegaRAID we use is not the build-in PERC. The built-in was only capable to do only 1.5 Gbit/S SATA which is way too slow, also JBOD or HBA mode are disabled. Unlike that, an added-on 9240-4i runs SSDs on their max interface speed of 6 Gbit/s, and allows for JBOD mode.
The card has no battery and no cache, so it was obvious it has too low performance when RAID was built with it, so both disks are configured as JBOD and used with software RAID. Theoretical maximum for 6 Gbit/s interface is 600 MB/s (considering 8-to-10-bits wire encoding), this is what to expect from a single drive sequential test.
We done extensive i/o tests both under Linux and under Windows, both with fio with same config. The only differences in the config were aio library (windowsaio in Windows, libaio in Linux) and test device specifications. fio config was adapted from this post: https://forum.proxmox.com/threads/pve-6-0-slow-ssd-raid1-performance-in-windows-vm.58559/#post-270657 . I can't show full fio outputs because this will hit ServerFault limit of 30k characters. I can share them somewhere else if somebody wants to see. Here I'll show only summary lines. Linux (Proxmox VE) was configured with MD RAID1 and "thick" LVM.
Caches inside SSDs are enabled:
# hdparm -W /dev/sd[ab]
/dev/sda:
write-caching = 1 (on)
/dev/sdb:
write-caching = 1 (on)
Devices run at full 6 Gb/s interface speed:
# smartctl -i /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.3.10-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 860 EVO 1TB
Serial Number: S4FMNE0MBxxxxxx
LU WWN Device Id: x xxxxxx xxxxxxxxx
Firmware Version: RVT03B6Q
User Capacity: 1 000 204 886 016 bytes [1,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Feb 7 15:25:45 2020 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
# smartctl -i /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.3.10-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 860 EVO 1TB
Serial Number: S4FMNE0MBxxxxxx
LU WWN Device Id: x xxxxxx xxxxxxxxx
Firmware Version: RVT03B6Q
User Capacity: 1 000 204 886 016 bytes [1,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Feb 7 15:25:47 2020 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Partitions were carefully aligned to 1 MiB, and the "main" large partition which is LVM PV and where all tests were done starts exactly at 512 MiB:
# fdisk -l /dev/sd[ab]
Disk /dev/sda: 931,5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: Samsung SSD 860
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 1DDCF7A0-D894-8C43-8975-C609D4C3C742
Device Start End Sectors Size Type
/dev/sda1 2048 524287 522240 255M EFI System
/dev/sda2 524288 526335 2048 1M BIOS boot
/dev/sda3 526336 1048575 522240 255M Linux RAID
/dev/sda4 1048576 1953525134 1952476559 931G Linux RAID
Disk /dev/sdb: 931,5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: Samsung SSD 860
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 63217472-3D2E-9444-917C-4776100B2D87
Device Start End Sectors Size Type
/dev/sdb1 2048 524287 522240 255M EFI System
/dev/sdb2 524288 526335 2048 1M BIOS boot
/dev/sdb3 526336 1048575 522240 255M Linux RAID
/dev/sdb4 1048576 1953525134 1952476559 931G Linux RAID
There is no bitmap:
# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md126 : active raid1 sda4[2] sdb4[0]
976106176 blocks super 1.2 [2/2] [UU]
md127 : active raid1 sda3[2] sdb3[0]
261056 blocks super 1.0 [2/2] [UU]
unused devices: <none>
LVM is created with 32 MiB PE size, so inside it everything is aligned to 32 MiB.
lsblk --discard
shows no device supports any TRIM (even non-queued). This is probably because of LSI2008 chip does not know this command. Queued TRIM is blacklisted on these SSDs: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-core.c?id=9a9324d3969678d44b330e1230ad2c8ae67acf81 . Anyway, this is still the same Windows sees, so comparison is fair.
The I/O scheduler was "none" on both disks. I also tried "mq-deadline" (the default), it showed worse results in general.
Under that configuration, fio showed following results:
PVEHost-128K-Q32T1-Seq-Read bw=515MiB/s (540MB/s), 515MiB/s-515MiB/s (540MB/s-540MB/s), io=97.5GiB (105GB), run=194047-194047msec
PVEHost-128K-Q32T1-Seq-Write bw=239MiB/s (250MB/s), 239MiB/s-239MiB/s (250MB/s-250MB/s), io=97.7GiB (105GB), run=419273-419273msec
PVEHost-4K-Q8T8-Rand-Read bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=799GiB (858GB), run=3089818-3089818msec
PVEHost-4K-Q8T8-Rand-Write bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=799GiB (858GB), run=6214084-6214084msec
PVEHost-4K-Q32T1-Rand-Read bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=98.7GiB (106GB), run=380721-380721msec
PVEHost-4K-Q32T1-Rand-Write bw=132MiB/s (139MB/s), 132MiB/s-132MiB/s (139MB/s-139MB/s), io=99.4GiB (107GB), run=768521-768521msec
PVEHost-4K-Q1T1-Rand-Read bw=16.8MiB/s (17.6MB/s), 16.8MiB/s-16.8MiB/s (17.6MB/s-17.6MB/s), io=99.9GiB (107GB), run=6102415-6102415msec
PVEHost-4K-Q1T1-Rand-Write bw=36.4MiB/s (38.1MB/s), 36.4MiB/s-36.4MiB/s (38.1MB/s-38.1MB/s), io=99.8GiB (107GB), run=2811085-2811085msec
On the exactly same hardware configuration, Windows was configured with Logical Disk Manager mirroring. Results are:
WS2019-128K-Q32T1-Seq-Read bw=1009MiB/s (1058MB/s), 1009MiB/s-1009MiB/s (1058MB/s-1058MB/s), io=100GiB (107GB), run=101535-101535msec
WS2019-128K-Q32T1-Seq-Write bw=473MiB/s (496MB/s), 473MiB/s-473MiB/s (496MB/s-496MB/s), io=97.8GiB (105GB), run=211768-211768msec
WS2019-4K-Q8T8-Rand-Read bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=799GiB (858GB), run=3088236-3088236msec
WS2019-4K-Q8T8-Rand-Write bw=130MiB/s (137MB/s), 130MiB/s-130MiB/s (137MB/s-137MB/s), io=799GiB (858GB), run=6272968-6272968msec
WS2019-4K-Q32T1-Rand-Read bw=189MiB/s (198MB/s), 189MiB/s-189MiB/s (198MB/s-198MB/s), io=99.1GiB (106GB), run=536262-536262msec
WS2019-4K-Q32T1-Rand-Write bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=99.4GiB (107GB), run=823544-823544msec
WS2019-4K-Q1T1-Rand-Read bw=22.9MiB/s (24.0MB/s), 22.9MiB/s-22.9MiB/s (24.0MB/s-24.0MB/s), io=99.9GiB (107GB), run=4466576-4466576msec
WS2019-4K-Q1T1-Rand-Write bw=41.4MiB/s (43.4MB/s), 41.4MiB/s-41.4MiB/s (43.4MB/s-43.4MB/s), io=99.8GiB (107GB), run=2466593-2466593msec
The comparsion:
windows none mq-deadline comment
1058MB/s 540MB/s 539MB/s 50% less than Windows, but this is expected
496MB/s 250MB/s 295MB/s 40-50% less than Windows!
278MB/s 278MB/s 278MB/s same as Windows
137MB/s 138MB/s 127MB/s almost same as Windows
198MB/s 278MB/s 276MB/s 40% more than Windows
130MB/s 139MB/s 130MB/s similar to Windows
24.0MB/s 17.6MB/s 17.3MB/s 26% less than Windows
43.4MB/s 38.1MB/s 28.3MB/s 12-34% less than Windows
Linux MD RAID1 only reads from both drives if there are at least two threads. First test is single thread, so Linux will read from a single drive and will achieve a single drive performance. This is justifiable and this first test result is fine. But others...
These only host tests. When we compared what is going when we ran same tests inside VMs, the last lines showed even worse, in Windows VM under PVE (no ballooning fixed memory, fixed CPU frequency, virtio scsi v171, writeback with barriers), it displayed 70% less than under Windows under Hyper-V. Even Linux VM under PVE shows results much worse than Windows under Hyper-V:
windows, windows, linux,
hyper-v pve pve
128K-Q32T1-Seq-Read 1058MB/s 856MB/s 554MB/s
128K-Q32T1-Seq-Write 461MB/s 375MB/s 514MB/s
4K-Q8T8-Rand-Read 273MB/s 327MB/s 254MB/s
4K-Q8T8-Rand-Write 135MB/s 139MB/s 138MB/s
4K-Q32T1-Rand-Read 220MB/s 198MB/s 210MB/s
4K-Q32T1-Rand-Write 131MB/s 146MB/s 140MB/s
4K-Q1T1-Rand-Read 18.2MB/s 5452kB/s 8701kB/s
4K-Q1T1-Rand-Write 26.7MB/s 7772kB/s 10.7MB/s
During these tests, Windows under Hyper-V was quite responsible despite large I/O load, same Linux under PVE. But when Windows ran under PVE, its GUI was slow to crawl, RDP session tended to disconnect itself due to packet drop, and HA on the host was up to 48, which was mostly due to huge i/o wait!
During the test saw quite large load on a single core, which happened to serve a "megasas" interrupt. This card only shows a single interrupt source, so no way to spread this "in hardware". Windows didn't show such single-core load during test, so it seems it uses some kind of interrupt steering (spreads its load on the cores). And overall CPU load was perceived as lower in Windows host test than that in Linux host. This could not be directly compared, however.
The question is: why it sucks so much, am I missing something? Is it possible to have a performance comparable to that of Windows? (I am writing this with shaking hands and lost for words, it is very unpleasant to be catching-up in comparison with Windows.)
Additiontal tests as @shodanshok suggested:
[global]
ioengine=libaio
group_reporting
filename=/dev/vh0/testvol
direct=1
size=5G
[128K-Q1T32-Seq-Read]
rw=read
bs=128K
numjobs=32
stonewall
[128K-Q1T32-Seq-Write]
rw=write
bs=128K
numjobs=32
stonewall
[4K-Q1T32-Seq-Read]
rw=read
bs=4K
numjobs=32
stonewall
[4K-Q1T32-Seq-Write]
rw=write
bs=4K
numjobs=32
stonewall
[128K-Q1T2-Seq-Read]
rw=read
bs=128K
numjobs=2
stonewall
[128K-Q1T2-Seq-Write]
rw=write
bs=128K
numjobs=2
stonewall
The result:
128K-Q1T32-Seq-Read bw=924MiB/s (969MB/s), 924MiB/s-924MiB/s (969MB/s-969MB/s), io=160GiB (172GB), run=177328-177328msec
128K-Q1T32-Seq-Write bw=441MiB/s (462MB/s), 441MiB/s-441MiB/s (462MB/s-462MB/s), io=160GiB (172GB), run=371784-371784msec
4K-Q1T32-Seq-Read bw=261MiB/s (274MB/s), 261MiB/s-261MiB/s (274MB/s-274MB/s), io=160GiB (172GB), run=627761-627761msec
4K-Q1T32-Seq-Write bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=160GiB (172GB), run=1240437-1240437msec
128K-Q1T2-Seq-Read bw=427MiB/s (448MB/s), 427MiB/s-427MiB/s (448MB/s-448MB/s), io=10.0GiB (10.7GB), run=23969-23969msec
128K-Q1T2-Seq-Write bw=455MiB/s (477MB/s), 455MiB/s-455MiB/s (477MB/s-477MB/s), io=10.0GiB (10.7GB), run=22498-22498msec
Things are strange, why 128K-Q1T2-Seq-Read was so bad? (The ideal value is 1200MB/s.) 5 GiB per job is too small to have things settled? Everything else seems to be ok.
It is quite unliked that you are limited by IRQ service time if using only two SATA disks. Rather, it is very probable that the slow IO speed you see is the direct result of the MegaRAID controller disabling the disk's own, private DRAM caches which, for SSD, are critical to obtain good performance.
If you are using a PERC-branded MegaRAID card, you can enable the disk's private cache via
omconfig storage vdisk controller=0 vdisk=0 diskcachepolicy=enabled
(I wrote that from memory and only as an example; please check with theomconfig
CLI referenceAnyway, be sure to understand what this means: if disk cache is enabled when using consumer (ie: non-power-protected) SSD, any power outage can lead to data loss. If you host critical data, do not enable the disk cache; rather, buy enterprise-grade SSD which cames with powerloss-protected writeback cache (eg: Intel S4510).
If, and only if, your data are expendable, then feel free to enable the disk's internal cache.
Some more reference: https://notesbytom.wordpress.com/2016/10/21/dell-perc-megaraid-disk-cache-policy/