So, few months ago my boss throws at me "what happens we needed to scale to X clients, in a reasonably short period of time." sort of scenarios, somewhat of an engineering challenge if anything else, but of course more clients means more income.
So, right now we have a setup similar to below. Redundant Load Balancers, serving data from two webservers (sessions shared via memcache), and two backend fully redundant File Servers. Each file server has the whole shabang, RAID6 Array across the drives, hot-space drive, dual raid controllers, dual nics, multipathing yada yada yada.
The design is one that I feel is solid, High Availability through no single-point-of failure. High Performance by splitting the load to multiple web servers, high scalability in the sense that we should be able to just keeping adding more and more machines horizontally to scale to more and more clients. The primary bottleneck being how much storage each file server could hold, and how much space is allotted to each client. That part of this would scale out faster then the rest of the system. FileServer/Client "Pools" route was chosen because it scales horizontally and is cheaper then saying "okay we need to buy an even BIGGER SAN."
So, this is all very easy, more and more online storage, means more and more NFS mounts. And that's where my second-guessing comes into play. I want to make sure I've ironed out all the potential issues in this design before putting anything in place. As long as each piece of this puzzle is properly monitored, I think its controllable, but I wanted to get some other opinions first, maybe from people who've been down this road before.
I know of a few key issues that will have to be watched for.
- Hotspots on the file servers, or particular machines working harder then the rest of the pool.
- Bandwidth & Backend Switching. being there will be a lot of talking between webservers to NFS devices, A high quality switch will have to be in place with a high switch fabric capacity.
Unknown issues...
- NFS Mounts on Webservers, Will there be any issues with having each webserver have 2... 5... 10... 100 NFS Mounts open at any given time? Is there any way to make this easier or more friendly? Maybe a NFS Proxy of some kind? (that'd create an internal bottle neck, which makes me frown). I thought about a Distributed file system, and have the webservers just mount that, However it seems that Non-Proprietary, POSIX Compliant, Expandable without downtime, Internally Redundant file systems are either too immature for production work, are outrageously expensive, or are really good at hiding from Google.
Let me know what you guys think, and if you see any suggestions & optimizations, it would be greatly appreciated.
((Due it it being an open-ended question where there is no one specific "right answer," question is a community Wiki))
In 90% of the hosting situations you will run into, you'll have far more webservers than storage servers which alters your network design quite a bit. You'll run storage servers in pairs, as many primary/primary fileservers don't support more than mirrored replication. Large NFS servers will likely handle a dozen or so webservers if you have a 10g backbone. Your connecting lines will be virtual connections as you'll run web lan on one vlan, storage lan on a separate vlan, gigE for the web lan, 10g for the storage lan depending on budget. You mention dual primary storage servers, and then mention NFS mounts which are somewhat exclusionary. Are they truly running shared nothing or is it a dual head/dual shelf/single fcal setup?
Run your load balancers in dual primary to reduce the cutover time and increase potential throughput. Don't use more than 40% of their capacity so that you have enough spare once you have a failure.
You also need to account for MySQL/PostgreSQL/Cassandra/etc clusters as well -- they don't particularly like NFS mounts.
Lefthand Networks has a distributed filesystem product. GlusterFS is somewhat mature and would work depending on your workload. MogileFS/Hadoop is another possibility. Also, take a look at Ceph, GFS and OCFS.
As for the number of mounts, you'll mount each NFS server once or possibly twice per webserver. Dozens of mounts wouldn't be unheard of, but, you might end up segregating your web clusters to eliminate having hundreds of mounts. Hadoop can present itself to your network as a single global filesystem across a distributed network.
Like karmawhore said, having more file servers than web servers sounds like a pretty non-standard setup. Could you explain usage a little more?
From your Visio drawings and your description I can't figure out where your no-single-point-of-failure is when it comes to files. To me it looks like each NFS server has it's own subset of files?
You should probably look into building "fat" (that is, very heavy specced) file servers with a ton of RAM and using some sort of clustered or mirroring file system on it, starting with 2 NFS servers and then scaling with increments of 2. Perhaps http://www.drbd.org/ could work for you?
Example:
NFSA_1 <--> NFSA_2 NFSB_1 <--> NFSB_2
To explain why there are more file servers than web servers in the design above, the system Grufftech (I work with him) is speaking of is an online backup service for both consumers and businesses (backs up desktop computers and servers).
You might consider writing your own directory system to hash the files to different filesystem clusters and use http/https to get the files off the individual servers. This way, scaling up becomes a matter of adding more backend 'webservers' and you don't have to worry about each server maintaining a connection to every possible storage server. It is a little less efficient, but, greatly simplifies your setup. Since you maintain a map of the filesystem locations and can store the data on two or more separate fileserver pairs, you can simply do a backend http request to obtain the file/files when requested. Your pairs would be separated both physically and network topologically so that losing an entire rack wouldn't take out both units of your pair. This would allow you to use systems that didn't require as much software configuration out of the box, but, would require you to write more code to manage your distributed filesystem.
SuperMicro and a few other manufacturers make 4U chassis (SC847) that handle 36 3.5" External/Hotswap drives. Coupled with FreeBSD/OpenSolaris you could use ZFS (which would be very beneficial if you're going to use commodity drives).
Since you are backing up data, is it really necessary to have a posix compliant backend that supports cluster locking? Are clients likely to be competing for access to the same file? If not, any distributed filesystem that maintains things as a single mount would work as well. While our backups internally rsync from the source machines, client's requests for copies of files from backups are served with http.
GlusterFS, MogileFS or HDFS would also give you a single mount point for each of the webservers. This would allow you to access the data more conventionally, but, you lose some of the posix 'features'.
How are you replicating your data across your storage nodes? I'm assuming that you have Apache serving your web content from the same directory on the storage nodes...
We're working on setting up something similar, but I'm rather frustrated in figuring out a way to replicate the data without lag across the storage nodes.
Ideas?