Problem: I recently revamped one of my servers, it was tested prior to use, and functions well, however, a few days ago, I noticed approximately 4 times the usual amount of writes to the root volume. This is not a performance issue - the server runs fine.
My revamping was fairly extensive (a full rebuild) so I don't have a lot to go on in terms of cause. Briefly, my changes included:
- Upgrading Amazon's Linux (from 2011.02 to 2011.09) - which also resulted in a change from ext3 to ext4 for the root volume
- Moving from php-fcgi to php-fpm (currently using tcp)
- Moving from a reverse-proxy (nginx -> apache) setup, to nginx only
- Replacing vsftpd with pure-ftpd
- Replacing dkim-proxy with opendkim
- Replacing webmin with ispconfig
- Adding varnish as a caching layer for dynamic files (overkill for the volume of hits these sites get, but its an experiment)
- Adding a swap partition
Basic Setup:
- My swap space is mounted on its own EBS volume - the writes to the swap volume are negligible - I have essentially discounted this as the cause (there is ample free memory - and both
free
andiostat
show minimal swap usage). - My data (mysql database, user files (websites), all logs (from /var/log), mail, and varnish files on their own EBS volume (using
mount --bind
). The underlying EBS volume is mounted at/mnt/data
- My remaining files - operating system and core server applications (e.g. nginx, postfix, dovecot, etc.) - are the only thing on the root volume - a total of 1.2GB.
The new setup runs 'smoother' (faster, less memory, etc.) than the old system, and has been stable for 20 days (mid-October) - as far as I can tell, the elevated writes have existed for all this time.
Contrary to what I would expect, I have a low read volume (my reads are about 1.5% of my writes, both in terms of blocks and bytes on my root volume). I have not changed anything on the root volume (e.g. new installations, etc) in the past few days, yet write volume continues to be much higher than expected.
Objective: to determine the cause of the increased writes to the root volume (essentially, figure out if it is a process (and what process), the different (ext4) file system, or another issue (e.g. memory)).
System Information:
- Platform: Amazon's EC2 (t1.micro)
- O/S: Amazon's Linux 2011.09 (CentOS/RHEL derived)
- Linux kernel: 2.6.35.14-97.44.amzn1.i686
- Architecture: 32-bit/i686
- Disks: 3 EBS volumes:
- xvdap1, root, ext4 filesystem (mounted with noatime)
- xvdf, data, xfs filesystem (mounted with noatime,usrquota,grpquota)
- xvdg, swap
The root and data volumes are snapshotted once a day - however, this should be a 'read' operation, not a write one. (Additionally, the same practise was used on the previous server - and the previous server was also a t1.micro.)
The data that caused me to look into the I/O was in the details of my last AWS bill (which had above normal I/O - not unexpected, since I was setting up this server, and installing a lot of things at the start of the month), and subsequently at the CloudWatch metrics for the attached EBS volumes. I arrive at the '4 times normal' figure by extrapolating the i/o activity from November (when I haven't altered the server) to estimate a monthly value and comparing that with the I/O from past months when I was not working on my previous server. (I do not have exact iostat data from my previous server). The same quantity of writes have persisted through November, 170-330MB/hr.
Diagnostic information (uptime for the following outputs is 20.6 days):
Cloudwatch metrics:
- root volume (write): 5.5GB/day
- root volume (read): 60MB/day
- data volume (write): 400MB/day
- data volume (read): 85MB/day
- swap volume (write): 3MB/day
- swap volume (read): 2MB/day
Output of: df -h
(for root volume only)
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 4.0G 1.2G 2.8G 31% /
The used space has not increased noticeably since this system was launched (which to me suggests that files are being updated, not created/appended).
Output of: iostat -x
(with Blk_read
, Blk_wrtn
added in):
Linux 2.6.35.14-95.38.amzn1.i686 11/05/2011 _i686_
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s Blk_read Blk_wrtn avgrq-sz avgqu-sz await svctm %util
xvdap1 0.00 3.42 0.03 2.85 0.72 50.19 2534636 177222312 17.68 0.18 60.93 0.77 0.22
xvdf 0.00 0.03 0.04 0.35 1.09 8.48 3853710 29942167 24.55 0.01 24.28 2.95 0.12
xvdg 0.00 0.00 0.00 0.00 0.02 0.04 70808 138160 31.09 0.00 48.98 4.45 0.00
Output of: iotop -d 600 -a -o -b
Total DISK READ: 6.55 K/s | Total DISK WRITE: 117.07 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
852 be/4 root 0.00 B 26.04 M 0.00 % 0.42 % [flush-202:1]
539 be/3 root 0.00 B 528.00 K 0.00 % 0.08 % [jbd2/xvda1-8]
24881 be/4 nginx 56.00 K 120.00 K 0.00 % 0.01 % nginx: worker process
19754 be/4 mysql 180.00 K 24.00 K 0.00 % 0.01 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
3106 be/4 mysql 0.00 B 176.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
19751 be/4 mysql 4.00 K 0.00 B 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
3194 be/4 mysql 8.00 K 40.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
3156 be/4 mysql 4.00 K 12.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
3099 be/4 mysql 0.00 B 4.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
24216 be/4 web14 8.00 K 10.43 M 0.00 % 0.00 % php-fpm: pool web14
24465 be/4 web19 0.00 B 7.08 M 0.00 % 0.00 % php-fpm: pool web19
3110 be/4 mysql 0.00 B 100.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
579 be/4 varnish 0.00 B 76.00 K 0.00 % 0.00 % varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 3600 -w 1,1000,120 -u varnish -g varnish
582 be/4 varnish 0.00 B 144.00 K 0.00 % 0.00 % varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 3600 -w 1,1000,120 -u varnish -g varnish
586 be/4 varnish 0.00 B 4.00 K 0.00 % 0.00 % varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 3600 -w 1,1000,120 -u varnish -g varnish
587 be/4 varnish 0.00 B 40.00 K 0.00 % 0.00 % varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 3600 -w 1,1000,120 -u varnish -g varnish
1648 be/4 nobody 0.00 B 8.00 K 0.00 % 0.00 % in.imapproxyd
18072 be/4 varnish 128.00 K 128.00 K 0.00 % 0.00 % varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 3600 -w 1,1000,120 -u varnish -g varnish
3101 be/4 mysql 0.00 B 176.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
19749 be/4 mysql 0.00 B 32.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
19750 be/4 mysql 0.00 B 0.00 B 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
19752 be/4 mysql 0.00 B 108.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
19788 be/4 mysql 0.00 B 12.00 K 0.00 % 0.00 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
853 be/4 root 4.00 K 0.00 B 0.00 % 0.00 % [flush-202:80]
22011 be/4 varnish 0.00 B 188.00 K 0.00 % 0.00 % varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 3600 -w 1,1000,120 -u varnish -g varnish
To summarize the above (and extrapolate to daily values) it looks like, over the 10 minute period:
- [flush-202] wrote 26MB = 3.6GB/day
- php-fpm wrote 17.5MB = 2.4GB/day
- MySQL wrote 684KB = 96MB/day
- Varnish wrote 580KB = 82MB/day
- [jbd2] wrote 528KB = 74MB/day
- Nginx wrote 120KB = 17MB/day
- IMAP Proxy wrote 8KB = 1.1MB/day
Interestingly enough, it would appear that between [flush-202]
and php-fpm
it is possible to account for the daily volume of writes.
Using ftop
, I am unable to track down either the flush
or php-fpm
writes (e.g. ftop -p php-fpm
.
At least part of my problem stems from identifying which processes are writing to the root volume. Of those listed above, I would expect all to be writing to the data volume (since the relevant directories are symlinked there) (e.g. nginx
, mysql
, php-fpm
, varnish
directories all point to a different EBS volume)
I believe JBD2
is the journaling block device for ext4, and flush-202
is the background flush of dirty pages. The dirty_ratio
is 20 and the dirty_background_ratio
is 10. Dirty memory (from /proc/meminfo
) was typically between 50-150kB). Page size (getconf PAGESIZE
) is the system default (4096).
Output of: vmstat -s | grep paged
3248858 pages paged in 104625313 pages paged out
Output of: sar -B | grep Average
pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
Average: 1.38 39.57 113.79 0.03 36.03 0.38 0.02 0.29 73.66
The above appears to suggest a high number of pages paged out - however, I would expect that pages would be written to my swap partition if necessary, not to my root volume. Of the total memory, the system typically has 35% in use, 10% in buffers, and 40% cached, 15% unused (i.e. 65% free).
Output of: vmstat -d
disk- ------------reads------------ ------------writes----------- -----IO------
total merged sectors ms total merged sectors ms cur sec
xvda1 105376 14592 2548092 824418 10193989 12264020 179666824 626582671 0 7872
xvdf 126457 579 3882950 871785 1260822 91395 30081792 32634413 0 4101
xvdg 4827 4048 71000 21358 1897 15373 138160 307865 0 29
vmstat
consistently displays si
and so
values of 0
Output of: swapon -s
Filename Type Size Used Priority
/dev/xvdg partition 1048572 9252 -1
On the hunch that the I/O writes may be memory related, I disabled varnish, and restarted server. This changed my memory profile to 10% in use, 2% in buffers, and 20% cached, 68% unused (i.e. 90% free). However, over a 10 minute run, iotop gave similar results as previously:
- [flush-202] wrote 19MB
- php-fpm wrote 10MB
In the hour since the restart, there have already been 330MB written to the root volume with 370K pages swapped out.
Output of inotifywatch -v -e modify -t 600 -r /[^mnt]*
Establishing watches...
Setting up watch(es) on /bin /boot /cgroup /dev /etc/ home /lib /local /lost+found /opt /proc /root /sbin /selinux /src /sys /usr /var
OK, /bin /boot /cgroup /dev /etc/ home /lib /local /lost+found /opt /proc /root /sbin /selinux /src /sys /usr /var is now being watched.
Total of 6753 watches.
Finished establishing watches, now collecting statistics.
Will listen for events for 600 seconds.
total modify filename
23 23 /var/log/
20 20 /usr/local/ispconfig/server/temp/
18 18 /dev/
15 15 /var/log/sa/
11 11 /var/spool/postfix/public/
5 5 /var/log/nginx/
2 2 /var/run/pure-ftpd/
1 1 /dev/pts/
Looking a bit further into the above, almost all of the writes can be attributed to an (unknown) process that is running every 5 minutes and checking the status of a variety of services (like chkservd
on cPanel, but I don't use cPanel, and didn't install this). It amounts to 4 log files (cron, maillog, ftp, imapproxy) updated during the 10 minutes - and a few related items (postfix sockets, pure-ftpd connections). The other items are primarily modified ispconfig sessions, system accounting updates, and invalid (non-existent server_name) web access attempts (logged to /var/log/nginx).
Conclusions and Questions:
Let me start by saying I am a bit perplexed - I am usually fairly thorough, but I feel that I am missing something obvious on this one. Clearly, flush
and php-fpm
account for the bulk of the writes, however, I don't know why this might be the case. Firstly, let's take php-fpm - it shouldn't even be writing to the root volume. It's directories (both files and logs) are symlinked to another EBS volume. Secondly, the primary things that php-fpm should be writing are sessions and page-caches - which are both few and small - certainly not on the order of 1MB/min (more like 1K/min, if that). Most of the sites are read-only, with only occasional updates. The total size of all web files modified in the last day is 2.6MB.
Secondly, considering flush - the significant writes from it suggest to me that dirty pages are frequently being flushed to disk - but given that I typically have 65% free memory and a separate EBS volume for swap space, I can't explain why this would affect the writes on my root volume, especially to the extent that is occurring. I realize that some processes will write dirty pages to their own swap space (instead of using system swap space), but surely, immediately following a restart with the vast majority of my memory being free, I should not be running into any substantial amount of dirty pages. If you do believe this to be the cause, please let me know how I might identify which processes are writing to their own swap space.
It is entirely possible that the whole dirty pages idea is simply a red herring and is completely unrelated to my problem (I hope it is actually). If that is the case, my only other idea that there is something relating to ext4 journaling that wasn't present in ext3. Beyond that, I am currently out of ideas.
Update(s):
Nov 6, 2011:
Set dirty_ratio = 10
and dirty_background_ratio = 5
; updated with sysctl -p
(confirmed via /proc); reran 10 minute iotop test with similar results (flush wrote 17MB, php-fpm wrote 16MB, MySQL wrote 1MB, and JBD2 wrote 0.7MB).
I have changed all the symlinks I setup to use mount --bind
instead. Re-enabled varnish, restarted server; reran 10 minute iotop test with similar results (flush wrote 12.5MB, php-fpm wrote 11.5MB, Varnish wrote 0.5MB, JBD2 wrote 0.5MB, and MySQL wrote 0.3MB).
As at the above run, my memory profile was 20% in use, 2% in buffers, and 58% cached, 20% unused (i.e. 80% free) Just in case my interpretation of free memory, in this context, is flawed, here is the output of free -m
(this is a t1.micro).
total used free shared buffers cached
Mem: 602 478 124 0 14 347
-/+ buffers/cache: 116 486
Swap: 1023 0 1023
Some additional information:
Output of: dmesg | grep EXT4
[ 0.517070] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null)
[ 0.531043] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null)
[ 2.469810] EXT4-fs (xvda1): re-mounted. Opts: (null)
I also ran ftop and iotop simultaneously, and was surprised to notice that entries showing up in iotop, did not appear in ftop. The ftop list was filtered to php-fpm, since I could trigger writes of that process fairly reliably. I noted about 2MB of writes per page view for php-fpm - and I have yet to figure out what it could possibly be writing - any ideas on ascertaining what is being written would be appreciated.
I will try to turn off journaling in the next few days, and see if that improves things. For the moment though, I find myself wondering if I have an I/O problem or a memory problem (or both) - but I am having a hard time seeing the memory problem, if there is one.
Nov 13, 2011:
As the file system uses extents, it was not possible to mount it as ext3, additionally, attempts to mount it as read-only, simply resulted in it being remounted as read-write.
The file-system does indeed have journaling enabled (128MB journal), as is evident from the following:
Output of: tune2fs -l /dev/sda1 | grep features
has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
As per the following, about 140GB has been written to this volume in a bit under a month - just about 5GB/day.
Output of: dumpe2fs -h /dev/sda1
Filesystem volume name: /
Last mounted on: /
Filesystem UUID: af5a3469-6c36-4491-87b1-xxxxxxxxxxxx
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 262144
Block count: 1048576
Reserved block count: 10478
Free blocks: 734563
Free inodes: 210677
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 511
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
RAID stride: 32582
Flex block group size: 16
Filesystem created: Wed Sep 21 21:28:43 2011
Last mount time: Sun Nov 13 16:10:11 2011
Last write time: Sun Oct 16 16:12:35 2011
Mount count: 13
Maximum mount count: 28
Last checked: Mon Oct 10 03:04:13 2011
Check interval: 0 (<none>)
Lifetime writes: 139 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 18610
Default directory hash: half_md4
Directory Hash Seed: 6c36b2cc-b230-45e2-847e-xxxxxxxxxxx
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Journal size: 128M
Journal length: 32768
Journal sequence: 0x0002d91c
Journal start: 1
Continuing to look for open files, I tried using fuser
on the root volume:
Output of: fuser -vm / 2>&1 | awk '$3 ~ /f|F/'
root 1111 Frce. dhclient
root 1322 frce. mysqld_safe
mysql 1486 Fr.e. mysqld
root 1508 Frce. dovecot
root 1589 Frce. master
postfix 1600 Frce. qmgr
root 1616 Frce. crond
root 1626 Frce. atd
nobody 1648 Frce. in.imapproxyd
postfix 1935 Frce. tlsmgr
root 2808 Frce. varnishncsa
root 25818 frce. sudo
root 26346 Fr.e. varnishd
postfix 26925 Frce. pickup
postfix 28057 Frce. smtpd
postfix 28070 Frce. showq
Nothing unexpected, unfortunately. On the off-chance that it was due to underlying hardware, I restored yesterday's snapshot of the root volume (nothing had changed in the last day), and replaced the instance's root volume with the new one. As expected, this had no effect on the problem.
My next step would have been to remove journalling, however I stumbled across the solution before getting to that.
The problem lay in APC using file-backed mmap. Fixing this dropped disk i/o by about 35x - to (an estimated) 150MB/day (instead of 5GB). I might still consider removing journalling as this appears to be the major remaining contributor to this value, however, this number is quite acceptable for the time being. The steps taken to arrive at the APC conclusion are posted in an answer, below.