I'm interested in building two fault tolerant/redundant NFS servers with failover at Amazon EC2. I'm familiar with tools/technologies like DRBD, Heartbeat, etc. Does Amazon provide any specific way of achieving this through their platform?
A suitable example might be that files are kept on a separate, redundant EBS -- if a failure occurs, a new instance is automatically launched from a pre-built AMI, the EBS volume is mounted, and the IP address is transitioned seamlessly.
Is this possible? Are there better platforms than Amazon? Can you give me a broad idea of the underlying architecture we're talking about to pull this off?
On AWS, using GlusterFS with an Elastic Load Balancer and auto scaling EC2 instances should achieve what you want. I can't comment about any other IaaS.
Amazon does provide some of what you need to achieve your objective - and allows you to implement the rest.
Amazon's EC2 servers are essentially VPSes - you can setup Heartbeat/Corosync/Pacemaker, etc on them (although last time I checked, you cannot use broadcast on their network - you can use unicast though - udpu).
You mention two ideas which Amazon addresses (somewhat) separately: fault tolerance and redundancy.
There is no built in mechanism for redundancy on EC2, although depending on what you are looking for, there are some ways to achieve it.
Fault tolerance, on the other hand, is better provided for by the Amazon platform:
In addition to the above, you can pass custom parameters to your newly launched instances, or retrieve information about your currently running instances fairly easily - which may allow you to script some of the setup (and, of course, AWS does have an API that will let you script all the actions they offer - including remapping an elastic IP address, launching new instances, detaching/attaching EBS volumes, etc).
You described 'files are kept on a separate, redundant EBS...[which is then] mounted'. Firstly, on EC2, an EBS volume can only be attached to one instance at a time (so to copy data to it, the EBS volume would need to be attached). It is up to you to maintain redundancy (you can setup RAID arrays of EBS devices, or do pretty much anything else). The problem though, is that sometimes EBS volumes are not detached when an instance actually crashes - you can force detach them though (which has a better, but not perfect success rate), and you can snapshot a EBS volume, even in use (which you could then create a new EBS volume from and launch an AMI using). It is better (lower time to recover, more flexible, etc) though, to maintain replicas of your data across multiple instances, as opposed to across multiple EBS volumes on the same instance.
Another option is to use Zadara Storage, that is an NFS "as a service". Because it is a service you don't need to manage the NFS server stack, and it is HA by default. You don't even need to pay for the NFS server instances. You can connect all the EC2 machines to your shares using standard NFS.
Disclosure: I'm with Zadara Storage.