First of all, thanks for reading, and sorry for asking something related to my job. I understand that this is something that I should solve by myself but as you will see its something a bit difficult.
A small description:
Now
Storage => 1PB using DDN S2A9900 storage for the OSTs, 4 OSS , 10 GigE network. (lustre 1.6)
100 compute nodes with 2x Infiniband
1 infiniband switch with 36 ports
After
Storage => Previous storage + another 1PB using DDN S2A 990 or LSI E5400 (still to decide) (lustre 2.0)
8 OSS , 10GigE network
100 compute nodes with 2x Infiniband
Previous experience: transfered 120 TB in less than 3 days using following command:
tar -C /old --record-size 2048 -b 2048 -cf - dir | tar -C /new
--record-size 2048 -b 2048 -xvf - 2>&1 | tee /tmp/dir.log
So , big problem here, using big mathematical equations I conclude that we are going to need 1 month to transfer the data from one side to the new one. During this time the researchers will need to step back, and I'm personally not happy with this.
I'm telling you that we have infiniband connections because I think that may be there is a chance to use it to transfer the data using 18 compute nodes (18 * 2 IB = 36 ports) to transfer the data from one storage to the other. I'm trying to figure out if the IB switch will handle all the traffic but in case it just burn up will go faster than using 10GigE.
Also, having lustre 1.6 and 2.0 agents on same server works quite well, with this there is no need to go by 1.8 to upgrade the metadata servers with two steps.
Any ideas?
Many thanks
Note 1: Zoredache, we can divide it in two blocks (A)600Tb and (B)400Tb. The idea is to move (A) to new storage which is lustre2.0 formated, then format where (A) was with lustre2.0 and move (B) to this lustre2.0 block and extend with the space where (B) was.
This way we will end with (A) and (B) on separate filesystems, with 1PB each.
The goal is to get it so that every layer between the old storage and new storage goes faster than the maximum read speed you can get from your old machine. Their specs claim 6GB/s sequential (which this should be). That means that the minimum time possible to move the data would be in the realm of 46 hours, if you are able to get the advertised speed.
When you were using tar to move 120 TB in 3 days, you must have averaged just shy of half a GB per second, and that's considerably less than the 6 GB/s the specs claim. The true number will likely be somewhere in the middle.
First, tar might be your problem. I'm a storage guy, not a unix guy, but as far as I know, it can limit your throughput based on the processor speed. If you stick with this methodology, you can get the migration window down by increasing the number of nodes running the migration and having them work on different parts of the dataset. Keep adding nodes until the old machine is incapable of serving files faster.
Second, make sure that you're able to write from your migration node to your new storage as fast as you can read off the old storage. This might mean tweaking some settings on the new storage (especially if it has an old-fashioned mirrored write cache) as well as ensuring there are no network bottlenecks.
Lastly, and this might be a bit far fetched, if you can take the downtime and this box is serving LUNs over FC, you can insert a storage virtualization device into the data path that would allow you to continue using the storage, albeit slower, while you do the migration. IBM's SAN Volume Controller, Falconstore's virtualization appliance, or an HDS storage array are all capable of automating data-migration in the background without interrupting host access. None of them will be as fast as what you're used to, but it will allow you to do work while you migrate after the brief interruption needed to get the nodes working from the new storage heads.
It's probably not worth buying one since you won't be using it after you finish the migration, but you might be able to borrow or rent one.