As a followup question to his very popular question: Why is DNS failover not recommended?, I think it was agreed that DNS failover is not 100% reliable due to caching.
However the highest voted answer did not really discuss what is the better solution to achieve failover between two different data centers. The only solution presented was local load balancing (single data center).
So my question is quite simply what is the real solution to cross data center failover?
This started off as a comment...but it's getting too long.
Sadly most of the answers to the previous question are wrong: they assume that the failover has something to do with the TTL. The top voted answer is SPECTACTULARLY wrong, and notably cites no sources. The TTL applies to the zone record as a whole and has nothing to do with Round Robin.
From RFC 1794 (which is all about Round Robin DNS serving)
(IME it's nearer to 3 hours before you get full propogation).
From RFC 1035
RFC 1034 set out the requirements for Negative caching - a method for indicating that all requests must be served fresh from the authoritative DNS server (in which case the TTL does control failover) - in my experience support for this varies.
Since any failover would have to be implemented high in the client stack, it's arguably not part of TCP/IP or DNS - indeed, SIP, SMTP, RADIUS and other protocols running on top of TCP/IP define how the client should work with Round Robin - RFC 2616 (HTTP/1.1) is remarkable in not mentioning how it should behave.
However, in my experience, every browser and most other HTTP clients written in the last 10 years will transparently check additional A RRs if the connection appears to be taking longer than expected. And it's not just me:
Failover times vary by implementation but are in the region of seconds. It's not an ideal solution since (due to the limits of DNS) publishing of failed node takes the DNS TTL - in the meantime you have to rely on client side detection.
Round-Robin is not a substitute for other HA mechanisms within a site. But it does complement it (the guys who wrote HAProxy recommend using a pair of installations accessed via round robin DNS). It is the best supported mechanism for implementing HA across multiple sites: indeed, as far as I can determine, it is the only supported mechansim for failover available on standard clients.
A whole data center would need to go down or be unreachable for this to apply. Your backup at another data center would then be reached by routing the IP addresses to the other data center. This would happen through the BGP route announcements from the primary data center no longer being provided. The secondary announcements from the secondary data center would then be used.
Smaller businesses are generally not large enough to justify the expense of portable IP address allocations and their own autonomous system number to announce BGP routes with. In this case a provider would multiple locations is the way to go.
You either have to be reached via your original IP addresses, or via a change of IP address done by DNS. Since DNS is not designed to do this in the ways needed by what "failover" means (users can be out of reach by at least as long as your TTL, or the TTL imposed by some caching servers), going to the backup site with the same IPs is the best solution.
The simplest approach to dual DC redundancy would be a L2 MPLS VPN between the two sites, along with maintaining the BGP sessions between the two.
You essentially can then just have a physical IP per server and a virtual IP that floats between the two (HSRP/VRRP/CARP etc.). Your DNS would be routed to this particular IP and directed accordingly.
The next consideration would be split brain - but that's another question for another time.
Juniper wrote a good white paper on dual DC management with MPLS, you can grab the PDF here http://www.juniper.net/us/en/local/pdf/whitepapers/2000407-en.pdf