Background: I'm writing a web application that will be available to the world at large on a software-as-a-service basis. As input to my choosing a database platform, I've been reading up on 'NoSQL' databases like Cassandra, Riak, MongoDB, and Redis. Replication, sharding, and partitioning are prominent in the feature sets of all of these databases.
I'll end up choosing one of them and getting on with development, but what troubles me is whether I'll realistically be able to take advantage of all of that database sharding/distribution goodness in production.
I'm a one-man start-up with a production technology infrastructure consisting of one SliceHost slice running Centos. I don't ever expect to run my own server hardware. I just want to have a view now of the kind of server architecture I could move towards when my product becomes overwhelmingly successful. :-)
How do people set up clusters of servers to support their nifty NoSQL database distribution schemes in a VPS environment? In most (all?) cases, the database products expect to run on a private LAN, and therefore lack the authentication mechanisms you'd need to support clusters spanning the wild woolly Internet world. How does one establish a private LAN in 'the cloud'? Do VPS providers offer that kind of thing? SliceHost seems to offer private IP addresses, but those addresses are accessible by all of their customers; they are not restricted to one particular customer.
First off, don't add highly scalable NoSQL datastores before you need that level of performance. During initial development, and while you're getting your first many thousand customers, you can probably run just fine off one beefy SQL database. Stay with the proven solution until you need more; especially if your application development framework has some sort of Object-Relation Mapper layer that expects an SQL database.
Depends on the provider, there is no standard solution. Some providers may not have this (you mention Slicehost's "private" IPs being accessible by all Slicehost customers).
Amazon EC2 has built-in firewalling called a "security group". Only servers under your control can send IP packets to each other by default; and you get fine-grained management of these rules. Rackspace Cloud Server's support chat says their private network connection is private for each custumer account (presumably that means some sort of per-customer VLAN is used). Other good PaaS providers should have something similar; if not, you might consider switching provider.
With providers who don't offer truly private subnetworks, you could:
This would work. But it increases complexity, and IMHO negates much of the simple provisioning of additional hosts (because when you're adding a new server, you would also need to change firewalling rules to permit the new "private" IP). Simple provisioning is really a key part of what one wants from a PaaS provider.
You would not want the database exposed to the wider world no matter what type it is, so you need to limit connections to only those hosts that are authorised.
I have no knowledge of nosql systems but as you are planning to utilise slicehost servers then it would be straightforward to limit connections using iptables so that connections are explicitly set to only allow traffic amongst themselves (via their private ip addresses) thus setting up what amounts to your own private LAN.
I am investigating if IPSec can be used for the same scenario in our start up. Essentially IPSec has this Authentication Header sub-protocol which can authenticate a given IP packet based on whether its md5/sha-1 checksum matches the payload+ip. Now the twist is that we can specify a shared-secret which will also be used when computing the checksum on the packet. If the sender does not know the secret he will not be able to compute a checksum that can be verified by the recipient and so the packet will be dropped as corrupt.
Effectively we can group nodes on a LAN based on shared secret and only intra-group nodes can send IP packets to each other solving the problem.