Ping a Specific Port

Question

Luke

Asked: 2011-09-26 18:09:06 +0800 CST2011-09-26 18:09:06 +0800 CST 2011-09-26 18:09:06 +0800 CST

Database scalability with write-heavy application

772

I have a write-heavy application. The application is best compared to surveys - the customer creates custom questionares and this is saved to the database. Most of the requests are from their users submitting these forms. Later on our customers do complex reports and graphs on these submissions.

Making sure our application server (PHP) and the web server (Nginx) scales is quite easy, the trouble is scaling the database server onto multiple servers.

A lot of applications are more read heavy, so typically you'll have a master-slave replication setup where all writes go to a single master, but reads are distributed to the slaves. For us this doesn't work because we're doing writes most of the time.

I've seen mention of a master-master setup, but this typically hits a snag with auto incremented primary keys. The solution is typically to have one server do odd numbers, and the other do evens. I want to avoid that.

On some similar questions I've seen mention of the Tungsten Replicator and how it gives you a lot more flexibility with replication. Would this help me at all? What kind of benefits would this give me that MySQL's built in replication can not provide?

There is also MySQL Cluster, but this typically hits a snag with very large databases and complex queries (joins). I need to be able to run complex reports, so this probably won't work for me.

I'm looking for redundancy, automatic fail over, distributing requests, and data integrity.

Are there other RDMS that provide better solutions that are suitable for the web?

3 Answers

Voted

Paweł Brodacki · Answer 1 · 2011-09-27T00:32:07+08:00

There's no such a thing as a Grand Unified Database Layout. If there are custom questionaries, there, really, need to be custom tables. Otherwise you are on a quick path to a single-table-of-200-columns of VARCHAR(128)-with-no-primary-keys monstrosity out of thedailywtf.com, which is inefficient, unsupportable and will hurt you in the future.

Sharding, as recommended by toppledwagon may be a thing to consider, but first, double check, that your database is rationally designed. If it is not normalized, then have a very good, preferably backed by testing, reason, why it is not. If it has hundreds of tables, it's probably wrong. If it has single table, it is definitely wrong. Look at the ways you can divide your problem into independent sets. You will spend more effort up front, but the system will be better for it.

Million rows, with, let's say, 2k of data per row (which seems a lot of characters for a survey), is 2GB of memory. If you can throw a bit more hardware onto your problem, maybe you'll be able to keep your data set in RAM?

Which leads to the next question: What's your load in absolute numbers? Customer requests per second, translated to I/Os per second, divided into reads and writes per second, how many gigabytes of data, with what growth rate? How does your load scale with number of requests? Linearly? Exponentially? You don't have to publish your data, just write it down and think about it. What is it today, how do you think it is going to look in a year or two.

Wikipedia says a 15k rpm SAS drive will give you 175-210 IOps. How many do you need in RAID 10 to satisfy your current and projected load? How big is your data set? How many drives do you need to fit your dataset (probably a lot less than to meet the IOs requirement). Would buying a pair (or a dozen) of SSD be justifiable? Is local storage going to be just OK, or are you going to saturate two 8Gb fiber links to a high-end storage subsystem?

If currently you need 1k IOps, but have three 10k rpm HDDs in RAID 5, then there's no way your hardware will be able to satisfy your requirements. OTOH if your app has a user request per second and brings a 32 core 256 GB of RAM beast, backed by an enterprise-class storage to its knees, then chances are the problem lies not within hardware capabilities.

symcbean · Answer 2 · 2011-09-27T04:56:24+08:00

master-master setup, but this typically hits a snag with auto incremented primary keys

No - you just set up the auto-increment-increment and auto-increment-offset to avoid collisions

The solution is typically to have one server do odd numbers, and the other do evens. I want to avoid that.

Why? Surrogate keys, by their very nature are unrelated to the data they index. Assigning meaning to such values is very dangerous.

A quick look at the Tungsten link you provided does not reveal much about what it does - it does have a number of innacuracies (e.g. "you can do multiple masters replication, which is more than what you can do with MySQL native replication"). In the same paragraph it says that it can't handle conflicts. I'm not filled with confidence about the usefulness of this product.

Assuming that master-master replication (either with or without federation to limit replication) does not meet your requirements (but you need to re-examine your thinking about auto-increment field types) then you could shard the data between native clusters using mysqlproxy or use a nosql database.

toppledwagon · Answer 3 · 2011-09-26T20:42:49+08:00

toppledwagon

2011-09-26T20:42:49+08:002011-09-26T20:42:49+08:00

This sounds like a good case for sharding. If the data in one survey doesn't need immediate access to the data in another survey, then sharding your data will be easy. You'll setup a database that has basically a user ID key which points to a Survey DB. You can then setup multiple Survey DBs. Hopefully you'll also choose to set those up in a replicated tuples as well. Your application will need a bit of re-working.

Run your reports and do the joins in software. If that's also an option, sharding is the way to go.

0

Database scalability with write-heavy application

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?