I need to build a system to store & maintain a huge amount (20 [TB]) of time-series data (for many different instruments), so that it would support the following requirements:
(1) fast appends of new data, as new data comes in
(2) fast retrievals of existing (already stored) data
There are 10,000 instruments and 1000 data fields (updated every 1-min) to save for each instrument. Once data is written to disk, it remains unchanged (no issues of concurrent writing/reading).
Since there would be no need for any joins whatsoever (typical query is: give me all instruments for field 'X' on interval 'Y'
), I tend to store the data using flat binary files that will be named like this: fieldName.timeStamp.bin
; this way, I would be able to store all the data in flat binary files (no need for a huge expense for a giant server / commercial database) and still, queries will be fast.
Since it's a lot of data (circa 20[TB]), I thought I would need some logic to distribute the files (fieldName.timeStamp.bin
) between all of my machines. Here's what I had in mind: there will be a central machine to which all queries will be sent. this central machine (based on the field & timestamp requested) would route the query to the machine of interest, that would in turn return the requested data.
My questions are:
(1) is this design scalable as I think it is? any drawbacks?
(2) is there anything I am missing here that might hurt performance?
(3) is it really the best way to send all queries to a central machine, that would in turn route the query to the right machine? or would it be best to directly access the
right machine (suppose I know which one it is ) using NFS
?
(4) is there a faster way than NFS to access the
right machine to read data from it? are there other methods for sharing all the data that on the data machines with client machines?
All of my machines use Ubuntu Linux. As can be understood, there will be many client
machines that would access the various data machines
and read (only read, not write) data from them. my goal is to have the data read as fast as possible.
Tahoe Least Authority File System may solve many of these problems automatically, especially if you can work with their tools for retrieving the data. At least, I'd give it a look before making my own system. Without data on what the real bandwidth and latency requirements are, I can't say much more.
You may also want to take a look at OpenTSDB, a Hadoop-based system for storing and retrieving massive time-series data. I've never used it, but it sounds handy and at least near your purposes.
A few notes:
1) Using a centralized server seems unnecessary here. Why not make a hash of the filename and use a simple sort to decide which server to store / get the files? That way you don't need a central server to store / write the files.
2) Given the scale of the system you are talking about, I would look into using Lustre or GLuster to do the file system stuff for you instead of using NFS. Let them do the hard work for you. Both are used for systems much larger then this and have a solid track record.
3) If you do decide to role your own setup, I would take a strong look at OpenSolaris / Nexenta w/ZFS. For file systems that large, some of ZFS's strengths become really helpful:
a) ZFS does intelegent raid rebuilds. I can rebuild 16tb of data on a 10x2tb raid 50 drive setup in 30 hours. Which is much faster then if I was doing the same kind of rebuild with a hardware raid card. b) ZFS does not need to fsck, even with ext3/4 the fsck on partitions that large is going to be really painful. c) ZFS's I/O scheduler for writes is very strong. You can add a single SSD to hold the ZIL logs / LARC2 cache and get most of the gains of a SSD based storage system with the large data retention of hard drives. d) ZFS has a very robust NFSv4 server built in. Sharing is easy to configure. e) ZFS has built in file system level de-duplication which might be a huge win for you if the instrument readings often return similar results.