First a quick overview of the environment:
NetBackup running on Windows Servers (6.5.4 if you care) with LTO3 drives.
The backup target used to be a Solaris 9 server, on Sun hardware, with Veritas Volume Manger.
Rebuilt as RHEL5 box using LVM to manage the volumes, now on a Xiotech SAN. With a large number of volumes.
The nature of the data and the application that the box runs (Optix) is such that it used to write to a volume until it reached a certain size and then that volume was locked forever more. Hence we have \u01 \u02 \u03 ... \u50. A while back (still on the solaris build) we expanded and opened those volumes back up for writing so in any given day any or all of them might change. Backup throughput used to average 40MB/sec.
In the new Linux build we're averaging something closer to 8MB/sec. Given that here is 2.1TB of data here that's sort of wildly unacceptable, even running 4 streams it is taking 48+ hours to complete. I/O on the server is pegged. I am pretty sure it's not the SAN because other clients using the same class of storage and similar server hardware are backing up at a pokey but tolerable 20MB/sec.
I'm looking for ideas on improving throughput. The Solaris guys in the office next door are blaming LVM on Linux. Nobody thinks it's the backup environment, because that's still performing as expected everywhere else. The admin of the now very slow box says "I don't know it's not me, the users say it's fine." Which is probably true because it's a document management system and they're reading and writing small files.
Troubleshooting ideas? Has anybody seen LVM trash backup or other I/O performance? Especially given a largeish number of volumes holding a very large number (10 million maybe)of small files?
Edited to correct units.
Edited to add:
NIC is at 1000/Full (as checked from both the OS and Switch)
Filesystem is EXT3.
More new information....
The performance hit appears to be happening on several boxes running LVM and EXT3. Basically all the new RHEL5 boxes we built this summer.
Have you used sar or iostat to monitor the disk performance during the backup to see what Linux thinks about the disk performance?
What about using some sort of benchmark utility to test raw read performance of files on the system? I just came up with this, so this is probably a terrible way to do this, and this would really just be for sequential reading, but:
Basically, if you use a benchmarking utility to duplicate reading of all the small files, you can very if it is the disk, and go from there.
The following is no longer relevant:
If with 20 kb/s you mean kilobits, unless I messing this up because it is too early in the morning, your numbers don't add up. You said you have 2.1 terabytes at 20 kb/s:
Even if it was just 1 TeraByte:
Or if you meant kilobytes:
Am I messing up these calculations? ( will have to delete my post in shame if I am :-) )
The problem turns out to have been a NetBackup client version problem, more than it was a linux/LVM problem. When the box got rebuilt as a linux box the 6.5 client was installed. Today in response to another issue we upgraded client version to 6.5.4. I am back to pulling data off the box at 25-27mb/sec.
How it is I could have forgotten the number one rule of NetBackup, or probably any backup software, MAKE SURE YOUR CLIENT AND SERVER VERSIONS MATCH if you are having a problem I don't know. Maybe I need a tatoo.
what file system are you using on the LVM volume(s)?
and how are the 10 million small files stored - all in one directory (or a small number of directories), or spread across many directories and subdirectories? ("many" being an arbitrarily large number)
the reason i ask is that some file systems have severe performance problems when you have thousands of files in them. this is one possible cause of your slow down.
for example, ext2 or ext3 without the dir_index feature turned on (IIRC, dir_index has been the default on ext3 for several years now. it helps a lot but it doesn't eliminate the problem entirely).
you can use tune2fs to query and/or set the dir_index feature for ext3. e.g. to query:
if you don't see dir_index in that list, then you need to turn it on like so:
and to set:
(yes, tune2fs only responds here by printing its version number...doesn't bother telling you whether the operation succeeded or failed. not good, but presumably it would print an error if it failed)
finally: if this does turn out to be the problem, and enabling dir_index doesn't help, then you probably need to consider using a different filesystem. XFS is a good general purpose filesystem, and AFAIK ext4 doesn't have this problem. either would be a reasonable choice for a replacement fs (although ext4 is quite new and even though many people are using it without problems, i'm not sure i'd trust it on production servers just yet)
LVM itself really shouldn't be impacting this. To my knowledge the LVM bits are not referenced on every metadata operation, which is where a delay would come in to play. It's at a different layer in the kernel. LVM would affect mount/unmount more than it would affect file open/close.
Far more likely is what Craig pointed out, large directories impairing performance. Linux is somewhat notorious for not handling the large-directory problem well. VxFS can handle up to 100K files/directory quickly, where ext2/ext3/reiserfs generally start slowing down well before then. This is one area where a poor choice in filesystem for migration target can seriously impair your backup performance.
That said, if this is your problem, just plain old access into and out of those directories should also be impaired. It may be the difference between 80ms to open a file and 210ms, which is barely perceptible to end-users, but it should be there.