[Sorry, I tried to stay short but it just was not possible]
I'm running Linux 2.6.22.19 with linux-vserver 2.3.0.34 on a Fujitsu Siemens Primergy RX200 S3 for development purpose. Intel(R) Xeon(R) 5140 @ 2.33Ghz with 4GB RAM (where mostly 500MB are still free). The server has two hot-plug 250GB used in a mirroring RAID configuration:
dev:~# mpt-status --newstyle
ioc:0 vol_id:0 type:IM raidlevel:RAID-1 num_disks:2 size(GB):231 state: OPTIMAL flags: ENABLED
ioc:0 phys_id:0 scsi_id:1 vendor:ATA product_id:WDC WD2500JS-55N revision:2E01 size(GB):232 state: ONLINE flags: NONE sync_state: 100 ASC/ASCQ:0xff/0xff SMART ASC/ASCQ:0xff/0xff
ioc:0 phys_id:1 scsi_id:8 vendor:ATA product_id:WDC WD2500JS-55N revision:2E01 size(GB):232 state: ONLINE flags: NONE sync_state: 100 ASC/ASCQ:0xff/0xff SMART ASC/ASCQ:0xff/0xff
scsi_id:0 100%
scsi_id:1 100%
I'm running LVM and ext3.
We've using this machine since around Jul/Aug 2007 and beside a broken RAM which was fixed the same day, there were no problems. And, basically, everything was better then the machine we had before.
However, performance problems were first noticed around Aug 2008 and it was not until recently that I've at least some confidence were the problem comes from. An average of seven vservers is running nowadays (three MySQL machines, two tomcats, three Apaches, Hudson, CruiseControl, MediaWiki, Samba, etc.). But not to get a false impression, we're a small company in terms of developers and other people accessing the server so there's not that much going on (browsing MediaWiki, Hudson automation is run in the evening, most Apache/PHP apps have much static content).
Once I installed munin I started to see interesting things especially during the night. Since within every vserver there was find
running (at the same time, everywhere) the load shot up to unreal numbers like 15, 17 or 20.
But ultimately the problem didn't just stay during the night (I started disabling the find jobs, there weren't used anyway) but also during the day, especially when we recently started on a project were we had to work with a database with a 500MB MySQL dump file.
The first time I imported the dump file during working hours (mysql < dump.sql
; within one of our vservers where MySQL instance was running) the timed output was:
real 71m28.630s
user 0m15.477s
sys 0m10.185s
Since I wasn't paying attention and was in a meeting, it was just my co-workers asking me what's on with the server because he was terribly slow.
I remade the test during night, installed a vanilla Debian MySQL on the host (not inside a guest; shut down all of them) and hit the following numbers:
real 48m33.067s
user 0m15.397s
sys 0m13.517s
and I still was like Yeah, fine, it's 500MB dump file; dumping into InnoDB space takes up around 1GB, that's quite some data. I did some tests like writing a single line to a file with vim
during such a test and capturing it with strace:
0.000048 write(1, "\33[?25l\"foo\" \33[78;7H\33[K", 22) = 22
0.000033 stat64("foo", 0xbfda1f04) = -1 ENOENT (No such file or directory)
0.000033 open("foo", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
0.000035 write(4, "thios isthis ia a testa\n", 24) = 24
0.000144 fsync(4) = 0
7737.631089 stat64("foo", {st_mode=S_IFREG|0664, st_size=24, ...}) = 0
0.000084 close(4) = 0
0.000117 write(1, "\33[78;7H[New] 1L, 24C written", 28) = 28
0.000051 lseek(3, 0, SEEK_SET) = 0
0.000022 write(3, "b0VIM 7.0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0!\21\0\0mark"..., 4096) = 4096
0.000052 select(1, [0], NULL, [0], {0, 0}) = 1 (in [0], left {0, 0})
This was an unbelievable yet a fact number for me. It seems that stat64 syscall was forced to wait for the dump operation to be finished. Not to mention serving pages in MediaWiki during such a dump takes minutes, too.
Anyway, I arranged a test timeframe with our hosting company to test on our production server during the night and I got a completely different picture:
real 7m4.063s
user 0m2.690s
sys 0m30.500s
I was blown away. I've also got an Amazon EC2 approval for testing and I got even lower numbers (around 5 minutes on a m1.large instance where the MySQL data was written to to the EBS volume to stay permanent).
Additionally, when importing the data, every other operation gets so slow that things just get unusable and the load rises quickly up to 5 or 7 (although not really much seems to be going on; it looks like the processes start to block each other for now reason).
I then started to do bonnie++ tests, which were looking like this (I actually had a test run from last year which I almost forgot). So here is Oct 2008:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
vserver-host 8G 12332 21 12600 3 10290 2 48519 74 52431 6 235.8 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
vserver-host,8G,12332,21,12600,3,10290,2,48519,74,52431,6,235.8,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
and here's the current one from Apr 2009:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
vserver-host 8G 9705 16 7268 1 4823 1 29620 45 41336 5 73.1 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 11678 15 +++++ +++ +++++ +++ +++++ +++ +++++ +++ 27739 31
vserver-host,8G,9705,16,7268,1,4823,1,29620,45,41336,5,73.1,0,16,11678,15,+++++,+++,+++++,+++,+++++,+++,+++++,+++,27739,31
I don't know where to tune what to get rid of the problems, as I'm not completely sure yet where/what the real problem is. I guess I started failing to see the wood for the trees.
Update 1:
Thanks for the feedback so far, I've picked the things first I can easily test without scheduling a maintenance time.
First I've tested the MySQL dump against /dev/null
:
dev01:~$ time mysqldump -u root -h db01 database-p >/dev/null
Enter password:
real 0m18.871s
user 0m12.713s
sys 0m0.512s
The system load was barely noticeable:
10:27:18 up 77 days, 9:10, 7 users, load average: 0.16, 0.75, 0.67
[...]
10:27:33 up 77 days, 9:10, 7 users, load average: 0.52, 0.79, 0.69
Also the sar
output during this test didn't reveal anything special:
12:11:45 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
12:11:46 PM dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:11:46 PM dev254-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:11:46 PM dev254-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:11:46 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
12:11:47 PM dev8-0 5.00 0.00 200.00 40.00 0.18 36.00 20.00 10.00
12:11:47 PM dev254-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:11:47 PM dev254-1 25.00 0.00 200.00 8.00 0.74 29.60 4.00 10.00
Then I ran the tests when dumping onto a file:
dev01:~$ time mysqldump -u root -h db01 database -p >foo.sql
real 1m7.527s
user 0m13.497s
sys 0m2.724s
The time it took doesn't look that unusual for me, the dump ended with a 570MB file.
However, the load was quite interesting ...
10:30:49 up 77 days, 9:14, 7 users, load average: 0.76, 0.89, 0.75
10:30:57 up 77 days, 9:14, 7 users, load average: 1.34, 1.01, 0.79
10:31:05 up 77 days, 9:14, 7 users, load average: 2.13, 1.19, 0.85
10:31:13 up 77 days, 9:14, 7 users, load average: 2.68, 1.32, 0.89
10:31:21 up 77 days, 9:14, 7 users, load average: 3.79, 1.60, 0.99
10:31:29 up 77 days, 9:14, 7 users, load average: 4.05, 1.69, 1.02
10:31:37 up 77 days, 9:14, 7 users, load average: 4.82, 1.93, 1.10
10:31:45 up 77 days, 9:15, 7 users, load average: 4.54, 1.97, 1.12
10:31:53 up 77 days, 9:15, 7 users, load average: 4.89, 2.08, 1.16
10:31:57 up 77 days, 9:15, 7 users, load average: 5.30, 2.22, 1.21
10:32:01 up 77 days, 9:15, 7 users, load average: 5.12, 2.23, 1.22
10:32:05 up 77 days, 9:15, 7 users, load average: 5.03, 2.26, 1.24
10:32:13 up 77 days, 9:15, 7 users, load average: 4.62, 2.22, 1.23
... as was sar
...
12:16:47 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
12:16:48 PM dev8-0 116.00 0.00 21712.00 187.17 134.04 559.31 8.62 100.00
12:16:48 PM dev254-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:48 PM dev254-1 3369.00 0.00 26952.00 8.00 3271.74 481.27 0.30 100.00
12:16:48 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
12:16:49 PM dev8-0 130.00 0.00 17544.00 134.95 143.78 752.89 7.69 100.00
12:16:49 PM dev254-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:49 PM dev254-1 1462.00 0.00 11696.00 8.00 2749.12 1407.78 0.68 100.00
12:16:49 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
12:16:50 PM dev8-0 128.00 0.00 18400.00 143.75 143.68 1593.91 7.84 100.40
12:16:50 PM dev254-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:50 PM dev254-1 810.00 0.00 6480.00 8.00 1598.99 4911.94 1.24 100.40
During this tests I quickly fired up vim
to do a simple write tests. Loading of vim
itself was sluggish too, but I wasn't able to distill this information properly from the strace
. Only the pending stat64
call was clearly visible again:
0.000050 stat64("bla", 0xbf98caf4) = -1 ENOENT (No such file or directory)
0.000038 open("bla", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
0.000053 write(4, "fooo\n", 5) = 5
0.000051 fsync(4) = 0
5.385200 stat64("bla", {st_mode=S_IFREG|0664, st_size=5, ...}) = 0
0.000073 close(4) = 0
I'll append further file system tests ASAP.
Update 2 / Solution:
Based on the feedback from Mihai about MTP devices and RAID configuration I dug into that topic a bit more. Through lspci
I retrieved the RAID controller information:
dev:~$ sudo lspci|grep -i lsi
05:05.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01)
Out into the blue a gave a shot at http://www.ask.com/web?q=linux+problem+performance+SAS1068 and found write performance problem with LSI Logic SAS1068. The bug description further linked to blog talking about a DELL PE860 performance problem which too uses the SAS1068.
The article mentions the use of LSIUtil which is available from lsi.com. The article also has a extensive walk through the lsiutil
menus which described how to turn on write caching.
Not surprisingly, turns out this has a major effect on the performance. Before activating:
dev:~# time (dd if=/dev/zero of=bigfile count=1024 bs=1M; sync)
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 180.902 seconds, 5.9 MB/s
real 3m8.472s
user 0m0.004s
sys 0m3.340s
After enabling the write cache:
dev:~# time (dd if=/dev/zero of=bigfile count=1024 bs=1M; sync)
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 43.7899 seconds, 24.5 MB/s
real 0m46.413s
user 0m0.000s
sys 0m3.124s
This is like day and night, black and white. Sometimes feeling stupid is OK. I'm probably supposed to know that the RAID controller ships with write caching disabled by default.
Update 3:
Rather by accident I found a munin plugin called diskstat which gives detailed IO, read/write/wait, etc. graphs for all block devices. There's also a very nice and detailed blog post from the author, titled Graphing Linux Disk I/O statistics with Munin.
This was an unbelievable yet a fact number for me. It seems that stat64 syscall was forced to wait for the dump operation to be finished.
That's the clue you're looking for. My bet is that the culprit here is ext3. By default, under ext3, any fsync issued by any application under any user account forces the entire journal to disk. This is especially annoying if you're running databases which often have to issue a fsync() call. This is even more annoying when running over RAID devices which don't support write barriers (don't know if MPT devices support them, sorry.)
If you're reasonably sure about your UPS and server stability, try mounting that file system using the
data=writeback
mount option instead of the defaultdata=ordered
. You will be sacrificing reliability in the face of a server crash, but if your performance gets hit this bad anyway it's not a big loss.The long term solution might be switching to a better file system, though. Currently I would recommend XFS. I've used it for years without major problems. The only drawback is that it can't be shrunk the way ext2/3 can be shrunk - it can be expanded though.
Try to do the mysql dump do
/dev/null
. That will show you whether the writing or the reading part is responsible.Try to write the dump to a separate logical volume/file system or a RAM-disk.
Check the
journal
ling options on the file system.Check the
noatime
/atime
status on the file systemIs the file system filling up? Typically performance decreases when the amount of data on it reaches a certain threshold.
Some good suggestions here from other posters about ruling out the software, and tweaking your RAID performance. It's worth mentioning that if your workload is write-heavy, then going for hardware with a battery-backed write cache is likely to be the right thing to do if you're considering replacing your kit.
It smells like you're just filling up on IO capacity. First check to make sure that one of your disks isn't dying (sector remapping can do bad things to IO), and run
sar -d 1 0
and look at the%util
numbers for the physical disks. If that's up around 100%, then I'd just throw more spindles at it, putting them in a RAID-10. You won't get double the performance, but it'll give you some more ooomph. 250GB disks aren't exactly mind-bendingly expensive these days.Oh, and to pimp my work some more, and put a wall of text into battle against your wall of text, a guide I wrote to track down performance problems: http://www.anchor.com.au/hosting/development/HuntingThePerformanceWumpus covers everything you should need to know about tracking down performance problems.