We have to store basic information about 3 million products. Currently the info is one 180 mb CSV which gets updated quarterly.
There will be about 30,000 queries per day, but the queries are just a very simple key value store. We only need to look up the product ID and display the rest of the information (which would all be in one record).
This is for the web, so fast performance is critical.
Should we use MySQL, even though we really don't need a relational database? Should we just generate 3 million static html files every quarter? Should we store a one line CSV for each product on something like Amazon S3 or Rackspace Cloud Files? What is the best way to do this?
Because MySQL is so widely supported and this is really quite a trivial thing to do I would suggest going with it. Unless the server has at least a few GB of memory I would suggest sticking with MySQL rather than using an in-memory system.
Once you start putting your data into a database, whether it's MySQL or something else, you will quite likely find that you will find more uses for it. Right now you're only talking about key value pairs but the rest of the data relating to your products must be stored somewhere. If that's not in a database I can't imagine the data storage being very efficient.
Whatever you do, do not create those three million files. We have seen a number of questions here already resulting from the problems so many files create.
You can use dedicated Key-Value type of NoSQL database which is optimized for this kind of tasks. Have a look at:
Of course you may use MySQL or any other relational database, but solutions specially designed for key-value type of data supposed to be better (otherwise what is the point of designing them in first place, except possibly the fact that it will be much smaller (in terms of RAM and HDD) solution).
And now for something completely different:
Given:
Outside of the box solution:
Dump each product as a TXT resource record and store it in the DNS, e.g.:
Benefits:
Reasons why this might be a bad idea:
MySQL with MyISAM and some good indexes sounds perfect for this. There are a lot of other options of course, but MySQL is very widely (if not universally) supported on any commercial web host. Depending on the speed you require, memcached might also be worth looking at, but without knowing the size of each key/value pair, storing 3 million of them in memory might be an even worse idea than a 180Mb CSV file (oh wait, it's a 180Mb CSV file, so we do know how big they are. They must be pretty small pairs, so memcached could be even better).
You do not want 3 million static HTML files, it will hurt your file system badly. A one-line CSV, even on S3, is going to have the same problem. Nobody wants 3 million files in a folder.
You could use the Berkeley Database which does exactly this sort of thing, even if it hasn't been hip since the dawn of Perl5. Berkeley only supports key value pairs, and you tie the whole db to a hash and access it as such.
Using Berkeley is well detailed in many of the older Perl references sitting on your shelf or try the Perldoc for the BerkeleyDB CPAN Module. I generally avoid using Berkeley DB (although my employer has much ancient code in which it plays prominently, and some of the DBs are as large as yours), because it is no fun when your data gets more complex.
You've flagged your question as amazon S3.
I'd like to draw your attention to one of their other related products called Amazon SimpleDB.
It sounds like the SimpleDB data model would fit well with your type of application.
This is not a plug for it, but worth looking at especially if you're planning on using the Amazon cloud services.
The SDB data model resembles a spreadsheet.
See here for more info on it: http://aws.amazon.com/simpledb/ And the data model: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/
Even though 180mb of data can be easily handled by any relational database, I would highly recommend MongoDB (http://www.mongodb.org/) above MySQL, Redis, MemcacheDB and other simpler key-value stores or relational databases. The reason is that for this sort of problem, MongoDB is the fastest, most expressive system to use, allowing super fast dynamic updates with no schema restrictions, so your documents can have different formats if you like them to. I was at a presentation from guardian.co.uk the other day and they have made a policy decision to ban all relational databases and use MongoDB exlusively for serving their news. You can get a feel on how fast their website is and which has been online since 1995 (the oldest online newspaper in the UK). They have also gone through all sorts of bottlenecks in the past because of relational databases. For 180mb, MongoDB is going to be serving everything from in-memory, so sub-ms loading times are likely to be the case.
You said that your queries are just simple key lookups, with binary search you need 21 iterations on worst case, with hashed keys your queries are even faster. Three million records is small as long as you avoid joins (or other cartesian product-type operations) and linear searches.
I'd dare say pretty much anything would do fine. Your load is 30000 queries/day means that (assuming your load is constant throughout the day) you have a single query every 20 seconds; that's not too bad.
I'd recommend implementing in the technology that you are most familiar with first and then measure whether this is really the bottleneck of the system.
The best way to do this really depends on the quality and nature of your data and queries. For starters, 180MB of data in a single table for products is not a problem, whichever way you look at it. And 30k queries per day is even less of a problem. With a properly configured database, any old desktop can handle this load.
Others have already pointed out your two major options, MySQL or a noSQL database.
If you have a certain number of attributes that exist for every single product (such as manufacturer, price, warehouse number, etc. then your best option is to have columns for these attributes and convert your key/value pairs into a flat table format, with a product ID as the primary key for that table. This will work very well even if some columns are only used by half of the rows, since for most products you will only need to run 1 query to retrieve all their attributes. Considering that this is data about products, I would guess that it is quite likely that this is the structure of your data.
If the attributes vary widely in presence and data type, then you might be better of using a noSQL database, which handle this scenario more efficienty than traditional SQL databases.
Regarding performance: I have previously worked for an e-commerce company, where for a long time the website was provided with data from a MySQL server. This server had 2GB of RAM, the database in total was approx. 5GB in size and under top load the server handled several thousand queries per second. Yes, we had done a lot of query optimization, but this is definitely doable.