I need to store 25M Photos in 4 sizes = total 100M Files, the filesize will vary between 3Kb and 200 kb per file and the used storage at beginning is about 14-15 TB.
Our goal is to have the data on 2-4 Server available and to serve them with a local fast Webserver (nginx or lighthttpd), we need to server as much possible req/sec.
My plan is to use 2U Servercase from Intel with 12x2TB (WD RE4) with Raid 6 (or FS with redunancy??) for the data and 2x60GB SSD for the OS, is that a good way? Now: I found the Adaptec 5805ZQ who can use SSD SLC Drives for cache of most used files, any suggestions for it?
What Read cache size I need to choose?
What will be the best way for redunancy and load balancing, if I plan to have 2-4 of such Server?
Whats pro/con between Cluster and distributed FS regarding our goal?
If this is greenfield development, then I would absolutely use the cloud for this. 100 M files is a lot of data; it would be a major improvement to offload the redundant storage of that to fx Amazon S3.
Given that we're talking of 100 M files I believe we can safely say some parts of the data set will be 'hot' (frequently requested) and most parts will be cold. Hence we really want caching.
A overview of how this could be done on Amazon Web Services:
Since this setup is loosely coupled, scaling it out horizontally is easy (as scaling problems go).
How fast it is will depend greatly on the ratio between hot and cold data. If your workload is mostly hot, then I wouldn't be surprised to see well above 10.000 req/s from just 2 small load balancer EC2s and 2 high-mem cache EC2 instances.
SSD's for the OS itself is overkill, unless you're really really interested in booting 30s faster. Just get a pair of small SAS drives and it should be more than enough.
Wrt. the cache, the usefulness of the cache depends on the working set. I.e. are requests for the images expected to be spread evenly around all the images, or do you expect that most requests will be for a small subset? In the latter case, a cache might be useful, in the former, not so much. Note that cache on the disk controller is useful mostly for caching writes (if the cache is non-volatile), which is helpful for fsync()-intensive applications such as databases. For image serving I suspect the benefit won't be that big.
A problem with cluster and distributed FS's is that they are more complicated to set up, and especially distributed FS's are less mature than "normal" single node FS's. A cluster FS typically means shared storage, which means a relatively expensive SAN if you want to avoid single points of failure.
An alternative would be to set up a cluster running some kind of Amazon S3-lookalike that provides a HTTP accessible distributed and replicated key-value storage. E.g. openstack storage.
A lot depends on how frequently those items will be used. If you can expect a small percentage of them to be very active at a time, then you might want to consider Varnish to do your front-end handling, load balanced off your nginx/lighttpd backends. Since frequently used images would be cached, disk speed is a little less important.
However, if the images are not requested repeatedly and caching wouldn't provide a huge boost, nginx/lighttpd on a server or two would do it. You also need to consider the amount of bandwidth that you're going to be delivering. 800mb/sec of a small subset of your dataset would easily be cached by the OS. 800mb/sec of a huge subset of your dataset will likely run into an IO bottleneck in that you can't get the data off the disk fast enough to be served in which case you need to split your system into enough parts to have the IO bandwidth.
Even though you're running raid-6, that is still no substitute for backups, so, budget a similar machine to do backups, or possibly to act as a failover storage server.
I'd choose a custom cluster instead of a distributed FS, because it is simpler to understand and troubleshoot, while still working. I.e., the reliability tradeoffs of your own cluster are obvious, while it is a task on its own to figure out how a distributed FS reacts to a dead server or failed switch.
A possible solution to your type of problem is to split the whole photo archive into parts (say, 2 parts) and make the part id explicit in the URL (e.g., make it a subdomain or a GET parameter that is easy to extract with regular expressions). Then, you'll have 4 storage servers with photos (2 servers for each part). Use the fifth server as a reverse proxy that distributes and balances the load. All five servers can run lighttpd. I.e., I propose a very dumb, but working (for the company I worked in - with the total load of ~5000 requests per second, files with 3-10 KB in size, 8 TB of unique files total, server from 24 backends that, however, run a custom HTTP daemon instead of lighttpd) solution.
As for the disks and RAM: we used a software RAID-0 made of four fast but cheap SATA disks on each server (if a disk fails, all data can be copied anyway from a replica on a different server), plus a custom solution to take the whole server offline after a single read error. RAID-5 and RAID-6 are very bad speed-wise even if one disk fails, please don't use them. On the content servers, a lot of RAM is essential (as a disk cache), look for 24 GB or more. Even then, be prepared for a 30-minutes warmup time. On the reverse proxy, if you use lighttpd, please take into account that it buffers the whole upstream response into RAM as fast as possible, and can spend a lot of time pushing the cached photo to someone on dialup or GPRS (and during that time, needs that buffer in RAM). We also took 24 GB just to have identical configurations, but I am not sure if this is overkill. Memory-based HTTP cache on the reverse proxy is not essential (even if there are hot images!), because the OS-provided disk cache on the backends works just as well.
For making sure that all backends that serve the same part of your archive have the same data: this is easy. When publishing photos, just copy them to all servers. Then use rsync on old parts of the archive to correct for any discrepancies, thus making one copy the master.