I've been reading up on OpenStack and how we can re-create an EC2/S3-style cloud for our internal development and I'm having a hard time finding information on how the OpenStack cloud controller provides redundancy of the cloud management services.
I know I can setup multiple Swift and Nova nodes, but not a single document/article/howto/wiki contains information on:
a) what happens if the cloud controller node dies; and b) how to setup redundant cloud controllers.
It seems to me that, although it is massively scalable, there is a big single-point-of-failure built into OpenStack.
Can anyone with more experience on OpenStack please shed some light as to how it all works in regards to high-availability?
There are some high-availability configuration options for OpenStack. Two potential single points of failure are the following services, which traditionally run only on a single ("cloud controller") node :
For nova-api, I believe you can just run multiple instances on different physical nodes, since the state is maintained in the external database.
For configuring the network service to run in high availability mode, you need to use the
--multi_host
configuration option in your nova config file. See the OpenStack documentation on Existing High Availability Options for NetworkingHave not been playing around with OpenStack, but if the cloud controller truly is a single point of failure: one way to prevent problems with it would be to dedicate two servers for that and set up Heartbeat v2 (or Corosync/Pacemaker like it's called nowadays) between them in active-passive mode.
That way if the primary server dies for whatever reason, the other one picks up its workload in (milli)seconds.
"...Even better: The controller node now hosts only platform components that are not OpenStack internal services (MySQL and RabbitMQ are standard Linux daemons). So the cloud administrator can afford to pass the administration of them to an external entity, Database Team, a dedicated RabbitMQ cluster. This way, the central controller node disappears and we end up with a bunch of compute/API nodes, which we can scale almost linearly."
http://www.mirantis.com/blog/117072/