A pretty basic system running mirror+stripe on 7.2k rpm sas disks, not particularly loaded. No dedup, compression on all datasets. Scrub has been running for 15 days at the speed of a dead snail. Is there some optimization that needs be done, or can it be due to some faulty hw?
- Dell R510 with MD1200 enclosure.
- 2x Xeon E5620
- 48GB
- NexentaStor 3.1.3, Community edition
Some information:
scan: scrub in progress since Mon Apr 1 19:00:05 2013
171G scanned out of 747G at 141K/s, 1187h40m to go
0 repaired, 22.84% done
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c7t5000C500414FB2CFd0 ONLINE 0 0 0
c7t5000C500414FCA57d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c7t5000C500415C3B1Bd0 ONLINE 0 0 0
c7t5000C500415C5E4Fd0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c7t5000C500415DC797d0 ONLINE 0 0 0
c7t5000C500415DC933d0 ONLINE 0 0 0
logs
c7t5000A7203006D81Ed0 ONLINE 0 0 0
cache
c7t5000A72030068545d0 ONLINE 0 0 0
# iostat -en
---- errors ---
s/w h/w trn tot device
0 8887 0 8887 c2t0d0
0 0 0 0 c0t395301D6B0C8069Ad0
0 0 0 0 c7t5000C500415DC933d0
0 0 0 0 c7t5000A72030068545d0
0 0 0 0 c7t5000C500415DC797d0
0 0 0 0 c7t5000C500414FCA57d0
0 0 0 0 c7t5000C500415C3B1Bd0
0 0 0 0 c7t5000C500415C5E4Fd0
0 0 0 0 c7t5000C500414FB2CFd0
0 0 0 0 c7t5000A7203006D81Ed0
The spa_last_io is changed every time I run this
# echo "::walk spa | ::print spa_t spa_name spa_last_io spa_scrub_inflight" | mdb -k
spa_name = [ "syspool" ]
spa_last_io = 0x25661402
spa_scrub_inflight = 0
spa_name = [ "tank" ]
spa_last_io = 0x25661f84
spa_scrub_inflight = 0x21
Every 5 seconds, about 20-25 MB/s is written. Between those writes there's basically no reads or writes.
capacity operations bandwidth latency
pool alloc free read write read write read write
------------------------- ----- ----- ----- ----- ----- ----- ----- -----
syspool 427G 501G 0 0 0 0 0.00 0.00
c0t395301D6B0C8069Ad0s0 427G 501G 0 0 0 0 0.00 0.00
------------------------- ----- ----- ----- ----- ----- ----- ----- -----
tank 903G 1.84T 810 5.21K 1.50M 20.8M 9.42 4.71
mirror 301G 627G 22 1.00K 53.0K 3.96M 8.96 3.93
c7t5000C500414FB2CFd0 - - 20 244 50.1K 3.97M 6.70 1.14
c7t5000C500414FCA57d0 - - 19 242 48.2K 3.97M 7.60 1.12
mirror 301G 627G 25 1016 46.8K 4.10M 16.11 5.28
c7t5000C500415C3B1Bd0 - - 21 257 41.6K 4.11M 4.63 1.24
c7t5000C500415C5E4Fd0 - - 21 255 43.0K 4.11M 16.54 1.15
mirror 301G 627G 62 754 119K 3.03M 19.72 3.78
c7t5000C500415DC797d0 - - 57 219 114K 3.03M 9.99 1.15
c7t5000C500415DC933d0 - - 56 220 119K 3.03M 13.20 1.22
c7t5000A7203006D81Ed0 260K 46.5G 0 0 0 0 0.00 0.00
cache - - - - - -
c7t5000A72030068545d0 93.1G 8M 0 0 0 0 0.00 0.00
------------------------- ----- ----- ----- ----- ----- ----- ----- -----
Are the iostats telling me that I am spending more time waiting for disk then I should have to? Specifically the %b column
# iostat -xe
device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot
sd3 5.1 43.9 20.6 643.8 0.0 0.1 2.9 0 5 0 0 0 0
sd4 9.4 1.8 141.1 169.6 0.0 0.0 0.5 0 0 0 0 0 0
sd5 3.1 43.8 15.8 643.8 0.0 0.1 1.4 0 3 0 0 0 0
sd6 5.2 38.1 14.3 494.4 0.0 0.1 3.0 0 7 0 0 0 0
sd7 4.2 40.2 11.1 623.2 0.0 0.1 2.7 0 7 0 0 0 0
sd8 3.6 44.3 9.7 623.2 0.0 0.1 1.5 0 4 0 0 0 0
sd9 2.9 37.4 7.0 494.4 0.0 0.1 1.3 0 2 0 0 0 0
sd10 0.7 0.4 3.4 0.0 0.0 0.0 0.0 0 0 0 0 0 0
The latency a tad on the high side?
# zpool iostat 10 10
capacity operations bandwidth latency
pool alloc free read write read write read write
tank 909G 1.83T 86 2.82K 208K 12.7M 22.68 13.63
---------- ----- ----- ----- ----- ----- ----- ----- -----
tank 909G 1.83T 29 857 42.4K 3.50M 17.86 4.47
---------- ----- ----- ----- ----- ----- ----- ----- -----
tank 909G 1.83T 30 947 46.1K 3.54M 15.55 5.67
Applied some tweaking which made little difference. zfs_top_maxinflight set to 127, zfs_scrub_delay to 0, and zfs_scan_idle to 0.
# echo zfs_top_maxinflight | mdb -k
zfs_top_maxinflight:
zfs_top_maxinflight: 127
# echo zfs_scrub_delay/D |mdb -k
zfs_scrub_delay:
zfs_scrub_delay:0
# echo zfs_scan_idle/D |mdb -k
zfs_scan_idle:
zfs_scan_idle: 0
scan: scrub in progress since Wed Apr 17 20:47:23 2013
1.85G scanned out of 918G at 1.14M/s, 229h36m to go
0 repaired, 0.20% done
pre mdb tweak, notice the rather high b% column
$ iostat -nx -M 5
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t395301D6B0C8069Ad0
35.2 44.2 0.3 0.7 0.0 0.4 0.0 5.3 0 32 c7t5000C500415DC933d0
19.8 3.2 0.2 0.0 0.0 0.0 0.0 0.1 0 0 c7t5000A72030068545d0
31.2 46.2 0.2 0.7 0.0 0.3 0.0 4.4 0 27 c7t5000C500415DC797d0
30.6 46.8 0.2 0.8 0.0 0.4 0.0 4.6 0 28 c7t5000C500414FCA57d0
37.6 53.0 0.3 0.8 0.0 0.4 0.0 4.7 0 33 c7t5000C500415C3B1Bd0
37.6 53.6 0.3 0.8 0.0 0.5 0.0 5.6 0 39 c7t5000C500415C5E4Fd0
33.2 46.8 0.3 0.8 0.0 0.5 0.0 6.1 0 33 c7t5000C500414FB2CFd0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7t5000A7203006D81Ed0
post mdb tweak, notice the b% column, 80-85% time in busy wait
$ iostat -nx -M 5
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t0d0
0.2 27.2 0.0 0.3 0.0 1.0 0.0 35.4 0 18 c0t395301D6B0C8069Ad0
129.6 20.2 0.9 0.4 0.0 2.9 0.0 19.5 0 85 c7t5000C500415DC933d0
48.4 4.0 0.4 0.0 0.0 0.0 0.0 0.1 0 1 c7t5000A72030068545d0
130.4 19.8 0.9 0.4 0.0 3.0 0.0 20.2 0 84 c7t5000C500415DC797d0
125.8 25.8 0.9 0.5 0.0 2.9 0.0 19.2 0 80 c7t5000C500414FCA57d0
131.2 24.2 0.9 0.5 0.0 3.1 0.0 20.3 0 83 c7t5000C500415C3B1Bd0
130.6 25.8 0.9 0.5 0.0 3.5 0.0 22.5 0 88 c7t5000C500415C5E4Fd0
126.8 28.0 0.9 0.5 0.0 2.8 0.0 18.0 0 79 c7t5000C500414FB2CFd0
0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0 0 c7t5000A7203006D81Ed0
ZFS scrub operations operate on some fairly brain-dead principles. Most notably, it only spends time scrubbing when there's nothing else going on. If you poke a pool with just a bit of data access on a fairly constant basis, scrub will effectively starve itself and do nearly nothing.
Tunables to explore, with my quick notes on what it does (I last looked into this awhile ago, though):
All of these are changeable on the fly using "echo [tunable]/W0t[number]" to change, and "echo [tunable]/D" to view current setting (which I recommend doing before changing).
So in theory, and in general practice, if you were to, say, change zfs_scan_idle down to 10 (or 1 - or 0, if it supports that, would need to check the code) and zfs_scrub_delay down to 1 (or 0, if it supports that), and if your txg_synctime_ms setting is 5000 or more maybe change zfs_scan_min_time_ms up a bit, it should become a lot more aggressive about actually doing scrub operations even with some level of user I/O occurring.
In your specific case, the %b and asvc_t reported imply some very, very random read workload going on (spinning disks should do better than that if it is truly sequential), and you've already done the "easy" stuff as explained above. So, first I'd turn on zfs_no_scrub_prefetch, to disable prefetch on scrub operations, just to see if that helped. If no joy, depending on the version of Nexenta you're on - you may be running 30/5, 5/1 or 10/5 (that's shorthand we use for the settings of zfs_txg_timeout & (zfs_txg_synctime_ms*1000)). Change zfs_txg_timeout to 10 and zfs_txg_synctime_ms to 5000, then try upping zfs_scan_min_time_ms to 3000 or 4000. This tells ZFS it can spend a lot longer on scrubs, as compared to the default settings on older NexentaStor installs that use 5/1 as the defaults - but careful, this may starve normal I/O if the delay settings have also been set to basically 0!
Hope this helps. Good luck!
I suspect hardware...
Why would you let this run for 15 days? That's not normal. Stop the scrub -
zpool scrub -s tank
and check the system out.My answer comes a little late, but if this kind of thing happens to anyone else, here is my take on it: simply try out "dmesg". In my case, I wasn't doing a scrub, but I was copying files to the disks, and I was clearly hearing the disks being active for a few seconds, then all stopping for a longer time, and again working and so on. This was due to the failure of one SATA controller and dmesg gave me all the errors. I thought it was a failing disk at first, but then I realized it was actually the controller.
Scrub uses available system downtime, even on an unloaded server, it's about availability. Ram and Processor are the keys to scrub utilization, not disc. The more of these available, the better your scrub performance will be. However, certainly, in this case, the better your discs are laid out, in terms of ZPools, the better your scrub performance will be as well.
So, if your performance has been slow, and that does appear to be the case, I would look to these as potential reasons.