Ping a Specific Port

Question

Laura Thomas

Asked: 2009-07-23 15:23:10 +0800 CST2009-07-23 15:23:10 +0800 CST 2009-07-23 15:23:10 +0800 CST

Switch from Solaris to Linux cut backup speed 80%. Help me get the old speed back?

772

First a quick overview of the environment:

NetBackup running on Windows Servers (6.5.4 if you care) with LTO3 drives.
The backup target used to be a Solaris 9 server, on Sun hardware, with Veritas Volume Manger.
Rebuilt as RHEL5 box using LVM to manage the volumes, now on a Xiotech SAN. With a large number of volumes.

The nature of the data and the application that the box runs (Optix) is such that it used to write to a volume until it reached a certain size and then that volume was locked forever more. Hence we have \u01 \u02 \u03 ... \u50. A while back (still on the solaris build) we expanded and opened those volumes back up for writing so in any given day any or all of them might change. Backup throughput used to average 40MB/sec.

In the new Linux build we're averaging something closer to 8MB/sec. Given that here is 2.1TB of data here that's sort of wildly unacceptable, even running 4 streams it is taking 48+ hours to complete. I/O on the server is pegged. I am pretty sure it's not the SAN because other clients using the same class of storage and similar server hardware are backing up at a pokey but tolerable 20MB/sec.

I'm looking for ideas on improving throughput. The Solaris guys in the office next door are blaming LVM on Linux. Nobody thinks it's the backup environment, because that's still performing as expected everywhere else. The admin of the now very slow box says "I don't know it's not me, the users say it's fine." Which is probably true because it's a document management system and they're reading and writing small files.

Troubleshooting ideas? Has anybody seen LVM trash backup or other I/O performance? Especially given a largeish number of volumes holding a very large number (10 million maybe)of small files?

Edited to correct units.

Edited to add:

NIC is at 1000/Full (as checked from both the OS and Switch)

Filesystem is EXT3.

More new information....

The performance hit appears to be happening on several boxes running LVM and EXT3. Basically all the new RHEL5 boxes we built this summer.

4 Answers

Voted

Kyle Brandt · Answer 1 · 2009-07-24T04:11:56+08:00

Have you used sar or iostat to monitor the disk performance during the backup to see what Linux thinks about the disk performance?

What about using some sort of benchmark utility to test raw read performance of files on the system? I just came up with this, so this is probably a terrible way to do this, and this would really just be for sequential reading, but:

sudo dd if=/u1/some_large_file of=/dev/null

Basically, if you use a benchmarking utility to duplicate reading of all the small files, you can very if it is the disk, and go from there.

The following is no longer relevant:
If with 20 kb/s you mean kilobits, unless I messing this up because it is too early in the morning, your numbers don't add up. You said you have 2.1 terabytes at 20 kb/s:

Even if it was just 1 TeraByte:

1 TB = 8589934592 bits
8589934592 / 20 (bits a second) = 429496730 seconds
429496730 / 60 (seconds) = 7158278 minutes
7158278 minutes / 60 = 119,304 Hours
119,304 / 24 = 4971 (Days)

Or if you meant kilobytes:

1 terabyte = 1073741824 kilobytes
1073741824 / 20 kB/s = 53687091 seconds
53687091 seconds = 621 days

Am I messing up these calculations? ( will have to delete my post in shame if I am :-) )

Laura Thomas · Answer 2 · 2010-03-11T12:25:53+08:00

Best Answer

Laura Thomas

2010-03-11T12:25:53+08:002010-03-11T12:25:53+08:00

The problem turns out to have been a NetBackup client version problem, more than it was a linux/LVM problem. When the box got rebuilt as a linux box the 6.5 client was installed. Today in response to another issue we upgraded client version to 6.5.4. I am back to pulling data off the box at 25-27mb/sec.

How it is I could have forgotten the number one rule of NetBackup, or probably any backup software, MAKE SURE YOUR CLIENT AND SERVER VERSIONS MATCH if you are having a problem I don't know. Maybe I need a tatoo.

2

cas · Answer 3 · 2009-07-23T16:20:37+08:00

what file system are you using on the LVM volume(s)?

and how are the 10 million small files stored - all in one directory (or a small number of directories), or spread across many directories and subdirectories? ("many" being an arbitrarily large number)

the reason i ask is that some file systems have severe performance problems when you have thousands of files in them. this is one possible cause of your slow down.

for example, ext2 or ext3 without the dir_index feature turned on (IIRC, dir_index has been the default on ext3 for several years now. it helps a lot but it doesn't eliminate the problem entirely).

you can use tune2fs to query and/or set the dir_index feature for ext3. e.g. to query:

# tune2fs -l /dev/sda1 | grep feature
Filesystem features:      ext_attr resize_inode dir_index filetype sparse_super

if you don't see dir_index in that list, then you need to turn it on like so:

and to set:

# tune2fs -O dir_index /dev/sda1
tune2fs 1.41.8 (11-July-2009)

(yes, tune2fs only responds here by printing its version number...doesn't bother telling you whether the operation succeeded or failed. not good, but presumably it would print an error if it failed)

finally: if this does turn out to be the problem, and enabling dir_index doesn't help, then you probably need to consider using a different filesystem. XFS is a good general purpose filesystem, and AFAIK ext4 doesn't have this problem. either would be a reasonable choice for a replacement fs (although ext4 is quite new and even though many people are using it without problems, i'm not sure i'd trust it on production servers just yet)

sysadmin1138 · Answer 4 · 2009-07-23T17:59:12+08:00

sysadmin1138

2009-07-23T17:59:12+08:002009-07-23T17:59:12+08:00

LVM itself really shouldn't be impacting this. To my knowledge the LVM bits are not referenced on every metadata operation, which is where a delay would come in to play. It's at a different layer in the kernel. LVM would affect mount/unmount more than it would affect file open/close.

Far more likely is what Craig pointed out, large directories impairing performance. Linux is somewhat notorious for not handling the large-directory problem well. VxFS can handle up to 100K files/directory quickly, where ext2/ext3/reiserfs generally start slowing down well before then. This is one area where a poor choice in filesystem for migration target can seriously impair your backup performance.

That said, if this is your problem, just plain old access into and out of those directories should also be impaired. It may be the difference between 80ms to open a file and 210ms, which is barely perceptible to end-users, but it should be there.

-1

Switch from Solaris to Linux cut backup speed 80%. Help me get the old speed back?

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?