Ping a Specific Port

Question

Elliot B.

Asked: 2017-08-30 09:16:14 +0800 CST2017-08-30 09:16:14 +0800 CST 2017-08-30 09:16:14 +0800 CST

iostat reports significantly different '%util' and 'await' for two identical disks in mdadm RAID1

772

I have a server running CentOS 6 with two Crucial M500 SSDs configured in mdadm RAID1. This server is also virtualized with Xen.

Recently, I started seeing iowait percentages creep up in the top -c stats of our production VM. I decided to investigate and ran iostat on the dom0 so I could inspect activity on the physical disks (e.g., /dev/sda and /dev/sdb). This is the command I used: iostat -d -x 3 3

Here's an example of the output I received (scroll to the right for %util numbers):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.33    0.00   38.67     0.00   337.33     8.72     0.09    2.22    0.00    2.22   1.90   7.33
sdb               0.00     0.33    0.00   38.67     0.00   338.00     8.74     1.08   27.27    0.00   27.27  23.96  92.63
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    1.00     0.00     8.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md127             0.00     0.00    0.00   29.33     0.00   312.00    10.64     0.00    0.00    0.00    0.00   0.00   0.00
drbd5             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd3             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd4             0.00     0.00    0.00    8.67     0.00    77.33     8.92     2.03  230.96    0.00  230.96  26.12  22.63
dm-0              0.00     0.00    0.00   29.67     0.00   317.33    10.70     5.11  171.56    0.00  171.56  23.91  70.93
dm-1              0.00     0.00    0.00    8.67     0.00    77.33     8.92     2.03  230.96    0.00  230.96  26.12  22.63
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00   20.00     0.00   240.00    12.00     3.03  151.55    0.00  151.55  31.33  62.67
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-8              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-9              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-10             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-11             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

To my alarm, I noticed that there was a significant difference between /dev/sda and /dev/sdb in await (2ms vs 27ms) and %util (7% vs 92%). These drives are mirrors of one another and are the same Crucial M500 SSD so I don't understand how this could be. There is no activity on /dev/sda that should not also occur on /dev/sdb.

I've been regularly checked the SMART values for both of these disks and I've noticed that the Percent_Lifetime_Used for /dev/sda indicates 66% used while /dev/sdb reports a non-sensical value (454% used). I hadn't been too concerned up until this point because the Reallocated_Event_Count has remained relatively low for both drives and hasn't changed quickly.

SMART values for /dev/sda

SMART values for /dev/sdb

Could there be a hardware issue with our /dev/sdb disk? Any other possible explanations?

1 Answers

Voted

Elliot B. · Answer 1 · 2017-09-16T17:26:15+08:00

Best Answer

Elliot B.

2017-09-16T17:26:15+08:002017-09-16T17:26:15+08:00

I eventually discovered that this system was not being TRIMed properly and was also partitioned with insufficient overprovisioning (even though the Crucial M500 has 7% level 2 overprovisioning built-in). The combination of the two led to a severe case of write amplification.

Furthermore, this system houses a database with very high write activity leading to a very high number of small random writes. This sort of IO activity has a very poor outcome with write amplification.

I'm still not 100% certain why /dev/sda was performing better than /dev/sdb in iostat -- perhaps it was something akin to the silicon lottery where /dev/sda was marginally superior to /dev/sdb so /dev/sdb bottlenecked first.

For us, the two major takeaways are:

Overprovision your SSDs at 20% (taking into account your SSD may already have 0%, 7% or 28% level 2 overprovisioning).
Run TRIM on a weekly basis.

2

iostat reports significantly different '%util' and 'await' for two identical disks in mdadm RAID1

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?