Ping a Specific Port

Question

Phil

Asked: 2011-07-04 14:42:04 +0800 CST2011-07-04 14:42:04 +0800 CST 2011-07-04 14:42:04 +0800 CST

How to store 3 million records in key value format?

772

We have to store basic information about 3 million products. Currently the info is one 180 mb CSV which gets updated quarterly.

There will be about 30,000 queries per day, but the queries are just a very simple key value store. We only need to look up the product ID and display the rest of the information (which would all be in one record).

This is for the web, so fast performance is critical.

Should we use MySQL, even though we really don't need a relational database? Should we just generate 3 million static html files every quarter? Should we store a one line CSV for each product on something like Amazon S3 or Rackspace Cloud Files? What is the best way to do this?

9 Answers

Voted

John Gardeniers · Answer 1 · 2011-07-04T16:19:34+08:00

John Gardeniers

2011-07-04T16:19:34+08:002011-07-04T16:19:34+08:00

Because MySQL is so widely supported and this is really quite a trivial thing to do I would suggest going with it. Unless the server has at least a few GB of memory I would suggest sticking with MySQL rather than using an in-memory system.

Once you start putting your data into a database, whether it's MySQL or something else, you will quite likely find that you will find more uses for it. Right now you're only talking about key value pairs but the rest of the data relating to your products must be stored somewhere. If that's not in a database I can't imagine the data storage being very efficient.

Whatever you do, do not create those three million files. We have seen a number of questions here already resulting from the problems so many files create.

15

LazyOne · Answer 2 · 2011-07-04T14:57:31+08:00

LazyOne

2011-07-04T14:57:31+08:002011-07-04T14:57:31+08:00

You can use dedicated Key-Value type of NoSQL database which is optimized for this kind of tasks. Have a look at:

Redis -- Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
MemcacheDB -- MemcacheDB is a distributed key-value storage system designed for persistent.
others (one of such lists can be found here: http://nosql-database.org/)

Of course you may use MySQL or any other relational database, but solutions specially designed for key-value type of data supposed to be better (otherwise what is the point of designing them in first place, except possibly the fact that it will be much smaller (in terms of RAM and HDD) solution).

13

Theobroma Cacao · Answer 3 · 2011-07-05T12:48:24+08:00

And now for something completely different:

Given:

180MB/3M products = 62 bytes/product on average.
30,000 queries per day = 0.34 queries per second
Updated quarterly = essentially static data

Outside of the box solution:

Dump each product as a TXT resource record and store it in the DNS, e.g.:

$origin products.example.com.

product_1_name IN TXT "product 1 description"
product_2_name IN TXT "product 2 description"
...
product_3000000_name IN TXT "product 3000000 description"

Benefits:

extremely reliable and trusted (you already depend on it every day)
can be built on pretty much any platform
pretty much every language has support for DNS queries in one form or another
open source and commercial servers support different kinds of backend databases
can be trivially replicated (just specify multiple name servers)
handles atomic updates, even when replicated across a dozen servers
can be cryptographically signed to ensure data integrity
can handle orders of magnitude higher query per second rates (10,000 queries per second are easily handled with commodity hardware)

Reasons why this might be a bad idea:

you need to search the data (DNS is purely key/value lookup)
you need to hide the data (DNS has no confidentiality)

Mark Henderson · Answer 4 · 2011-07-04T14:45:24+08:00

Mark Henderson

2011-07-04T14:45:24+08:002011-07-04T14:45:24+08:00

MySQL with MyISAM and some good indexes sounds perfect for this. There are a lot of other options of course, but MySQL is very widely (if not universally) supported on any commercial web host. Depending on the speed you require, memcached might also be worth looking at, but without knowing the size of each key/value pair, storing 3 million of them in memory might be an even worse idea than a 180Mb CSV file (oh wait, it's a 180Mb CSV file, so we do know how big they are. They must be pretty small pairs, so memcached could be even better).

You do not want 3 million static HTML files, it will hurt your file system badly. A one-line CSV, even on S3, is going to have the same problem. Nobody wants 3 million files in a folder.

4

brainbuz · Answer 5 · 2011-07-04T20:16:07+08:00

brainbuz

2011-07-04T20:16:07+08:002011-07-04T20:16:07+08:00

You could use the Berkeley Database which does exactly this sort of thing, even if it hasn't been hip since the dawn of Perl5. Berkeley only supports key value pairs, and you tie the whole db to a hash and access it as such.

Using Berkeley is well detailed in many of the older Perl references sitting on your shelf or try the Perldoc for the BerkeleyDB CPAN Module. I generally avoid using Berkeley DB (although my employer has much ancient code in which it plays prominently, and some of the DBs are as large as yours), because it is no fun when your data gets more complex.

4

hookenz · Answer 6 · 2011-07-04T20:40:16+08:00

hookenz

2011-07-04T20:40:16+08:002011-07-04T20:40:16+08:00

You've flagged your question as amazon S3.

I'd like to draw your attention to one of their other related products called Amazon SimpleDB.
It sounds like the SimpleDB data model would fit well with your type of application.

This is not a plug for it, but worth looking at especially if you're planning on using the Amazon cloud services.

The SDB data model resembles a spreadsheet.

See here for more info on it: http://aws.amazon.com/simpledb/ And the data model: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/

4

snez · Answer 7 · 2011-07-05T02:55:17+08:00

snez

2011-07-05T02:55:17+08:002011-07-05T02:55:17+08:00

Even though 180mb of data can be easily handled by any relational database, I would highly recommend MongoDB (http://www.mongodb.org/) above MySQL, Redis, MemcacheDB and other simpler key-value stores or relational databases. The reason is that for this sort of problem, MongoDB is the fastest, most expressive system to use, allowing super fast dynamic updates with no schema restrictions, so your documents can have different formats if you like them to. I was at a presentation from guardian.co.uk the other day and they have made a policy decision to ban all relational databases and use MongoDB exlusively for serving their news. You can get a feel on how fast their website is and which has been online since 1995 (the oldest online newspaper in the UK). They have also gone through all sorts of bottlenecks in the past because of relational databases. For 180mb, MongoDB is going to be serving everything from in-memory, so sub-ms loading times are likely to be the case.

1

Lie Ryan · Answer 8 · 2011-07-04T23:50:12+08:00

Lie Ryan

2011-07-04T23:50:12+08:002011-07-04T23:50:12+08:00

There will be about 30,000 queries per day, but the queries are just a very simple key value store. We only need to look up the product ID and display the rest of the information (which would all be in one record).

You said that your queries are just simple key lookups, with binary search you need 21 iterations on worst case, with hashed keys your queries are even faster. Three million records is small as long as you avoid joins (or other cartesian product-type operations) and linear searches.

I'd dare say pretty much anything would do fine. Your load is 30000 queries/day means that (assuming your load is constant throughout the day) you have a single query every 20 seconds; that's not too bad.

I'd recommend implementing in the technology that you are most familiar with first and then measure whether this is really the bottleneck of the system.

0

wolfgangsz · Answer 9 · 2011-07-05T02:15:35+08:00

The best way to do this really depends on the quality and nature of your data and queries. For starters, 180MB of data in a single table for products is not a problem, whichever way you look at it. And 30k queries per day is even less of a problem. With a properly configured database, any old desktop can handle this load.

Others have already pointed out your two major options, MySQL or a noSQL database.

If you have a certain number of attributes that exist for every single product (such as manufacturer, price, warehouse number, etc. then your best option is to have columns for these attributes and convert your key/value pairs into a flat table format, with a product ID as the primary key for that table. This will work very well even if some columns are only used by half of the rows, since for most products you will only need to run 1 query to retrieve all their attributes. Considering that this is data about products, I would guess that it is quite likely that this is the structure of your data.

If the attributes vary widely in presence and data type, then you might be better of using a noSQL database, which handle this scenario more efficienty than traditional SQL databases.

Regarding performance: I have previously worked for an e-commerce company, where for a long time the website was provided with data from a MySQL server. This server had 2GB of RAM, the database in total was approx. 5GB in size and under top load the server handled several thousand queries per second. Yes, we had done a lot of query optimization, but this is definitely doable.

How to store 3 million records in key value format?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?