We have a set of shared, static content that we serve up between our websites at http://sstatic.net. Unfortunately, this content is not currently load balanced at all -- it's served from a single server. If that server has problems, all the sites that rely on it are effectively down because the shared resources are essential shared javascript libraries and images.
We are looking at ways to load balance the static content on this server, to avoid the single server dependency.
I realize that round-robin DNS is, at best, a low end (some might even say ghetto) solution, but I can't help wondering -- is round robin DNS a "good enough" solution for basic load balancing of static content?
There is some discussion of this in the [dns] [load-balancing] tags, and I've read through some great posts on the topic.
I am aware of the common downsides of DNS load balancing through multiple round-robin A records:
- there's typically no heartbeats or failure detection with DNS records, so if a given server in the rotation goes down, its A record must manually be removed from the DNS entries
- the time to live (TTL) must necessarily be set quite low for this to work at all, since DNS entries are cached aggressively throughout the internet
- the client computers are responsible for seeing that there are multiple A records and picking the correct one
But, is round robin DNS good enough as a starter, better than nothing, "while we research and implement better alternatives" form of load balancing for our static content? Or is DNS round robin pretty much worthless under any circumstances?
Jeff, I disagree, load balancing does not imply redundancy, it's quite the opposite in fact. The more servers you have, the more likely you'll have a failure at a given instant. That's why redundancy IS mandatory when doing load balancing, but unfortunately there are a lot of solutions which only provide load balancing without performing any health check, resulting in a less reliable service.
DNS roundrobin is excellent to increase capacity, by distributing the load across multiple points (potentially geographically distributed). But it does not provide fail-over. You must first describe what type of failure you are trying to cover. A server failure must be covered locally using a standard IP address takeover mechanism (VRRP, CARP, ...). A switch failure is covered by resilient links on the server to two switches. A WAN link failure can be covered by a multi-link setup between you and your provider, using either a routing protocol or a layer2 solution (eg: multi-link PPP). A site failure should be covered by BGP : your IP addresses are replicated over multiple sites and you announce them to the net only where they are available.
From your question, it seems that you only need to provide a server fail-over solution, which is the easiest solution since it does not involve any hardware nor contract with any ISP. You just have to setup the appropriate software on your server for that, and it's by far the cheapest and most reliable solution.
You asked "what if an haproxy machine fails ?". It's the same. All people I know who use haproxy for load balancing and high availability have two machines and run either ucarp, keepalived or heartbeat on them to ensure that one of them is always available.
Hoping this helps!
As load-balancing, it's ghetto but more-or-less effective. If you had one server that was falling over from the load, and wanted to spread it to multiple servers, that might be a good reason to do this, at least temporarily.
There are a number of valid criticisms of round-robin DNS as load "balancing," and I wouldn't recommend doing it for that other than as a short-term band-aid.
But you say your primary motivation is to avoid a single-server dependency. Without some automated way of taking dead servers out of rotation, it's not very valuable as a way of preventing downtime. (With an automated way of pulling servers from rotation and a short TTL, it becomes ghetto failover. Manually, it's not even that.)
If one of your two round-robined servers goes down, then 50% of your customers will get a failure. This is better than 100% failure with only one server, but almost any other solution that did real failover would be better than this.
If the probability of failure of one server is N, with two servers your probability is 2N. Without automated, fast failover, this scheme increases the probability that some of your users will experience failure.
If you plan to take the dead server out of rotation manually, you're limited by the speed with which you can do that and the DNS TTL. What if the server dies at 4 AM? The best part of true failover is getting to sleep through the night. You already use HAProxy, so you should be familiar with it. I strongly suggest using it, as HAProxy is designed for exactly this situation.
Round robin DNS is not what people think it is. As an author of DNS server software (namely, BIND) we get users who wonder why their round robin stops working as planned. They don't understand that even with a TTL of 0 seconds there will be some amount of caching out there, since some caches put a minimum time (often 30-300 seconds) no matter what.
Also, while your AUTH servers may do round robin, there is no guarantee the ones you care about -- the caches your users speak to -- will. In short, round robin doesn't guarantee any ordering from the client's point of view, only what your auth servers provide to a cache.
If you want real failover, DNS is but one step. It's not a bad idea to list more than one IP address for two different clusters, but I'd use other technology there (such as simple anycast) to do the actual load balancing. I personally despise hardware load balancing hardware which mucks with DNS as it usually gets it wrong. And don't forget DNSSEC is coming, so if you do choose something in this area ask your vendor what happens when you sign your zone.
I've said it several times before, and I'll say it again - if resiliency is the problem then DNS tricks are not the answer.
The best HA systems will allow your clients to keep using the exact same IP address for every request. This is the only way to ensure that clients don't even notice the failure.
So the fundamental rule is that true resilience requires IP routing level trickery. Use a load-balancer appliance, or OSPF "equal cost multi-path", or even VRRP.
DNS on the other hand is an addressing technology. It exists solely to map from one namespace to another. It was not designed to permit very short term dynamic changes to that mapping, and hence when you try to make such changes many clients will either not notice them, or at best will take a long time to notice them.
I would also say that since load isn't a problem for you, that you might just as well have another server ready to run as a hot standby. If you use dumb round-robin you have to proactively change your DNS records when something breaks, so you might just as well proactively flip the hot standby server into action and not change your DNS.
I've read through all answers and one thing I didn't see is that most modern web browsers will try one of the alternative IP addresses if a server is not responding. If I remember correctly then Chrome will even try multiple IP addresses and continue with the server that responds first. So in my opinion DNS Round Robin Load balancing is always better then nothing.
BTW: I see DNS Round Robin more as simple load distribution solution.
Its remarkable how many of the contributors are helping contribute dis-information about DNS Round Robin as a load spreading and resilience mechanism. It does usually work, but you do need to understand how it works, and avoid the mistakes caused by all that disinformation.
1) The TTL on DNS records used for Round robin should be short - but NOT ZERO. Having the TTL at zero breaks the main way that resilience is provided.
2) DNS RR spreads, but does not balance load, it spreads it because over a large client base, they tend to query the DNS server independently, and so end up with different first choice DNS entries. Those different first choices mean the clients are serviced by different servers, and the load is spread out. But it all depends on which device is doing the DNS query, and how long it holds the result. A common example is that all the clients behind a corporate proxy (that performs the DNS query for them) will all end up targeting a single server. Load is spread - but it isn't balanced evenly.
3) DNS RR provides resilience as long as the client software implements it properly (and both the TTL and the users attention span isn't too short). This is because DNS round robin provides an ordered list of server IP addresses, and the client software should try to contact each one of them in turn, until it finds a server that accepts the connection.
So if the first choice server is down then the client TCP/IP connection times out, and provided neither the TTL or attention span has expired, then the client software makes another connection attempt to the second entry in the list - and so on until the TTL expires, or it gets to the end of the list (or the user gives up in disgust).
A long list of broken servers (your fault) and large TCP/IP connect retry limits (Client configuration misfeature) can make for a long period before the Client actually finds a working server. Too short a TTL means that it never gets to work its way to the end of the list, and instead issues a new DNS query and gets served a new list (hopefully in a different order).
Sometimes the Client gets unlucky and the new list still starts with broken servers. To give the system the best chance of providing client resilience you should ensure the TTL is longer than the typical attention span and for the client to get to the bottom of the list.
Once the client has found a working server it should remember it, and when it needs to make the next connection it should not repeat the search (unless the TTL has expired). A longer TTL reduces the frequency with which users experience a delay while the client searches for a working server - giving a better experience.
4) DNS TTL comes into its own, when you want to manually change the DNS records (e.g. to remove a long term broken server) then a short TTL allows that change to propagate quickly (once you have got around to doing it), so consider the balance between how long it will take before you know about the issue, and make that manual change - and the fact that normal clients will only have to do a new search for a working server when the TTL expires.
DNS round robin has two outstanding feature that makes it very cost effective in a wide range of scenarios - firstly its free, and secondly it is almost as geographically dispersed as your client base.
It does not introduce a new 'unit of failure' which all the other 'clever' systems do. There are no added components which could experience a common and simultaneous failure over a whole load of inter-linked elements.
The 'clever' systems are great and introduce wonderful mechanisms to coordinate and provide a seamless balancing and fail over mechanism, but ultimately the very methods that they use to provide that seamless experience are their Achilles heel - the additional complicated thing that can go wrong, and when it does, will provide a seamless experience of failure system wide.
So YES, DNS round robin is definitely "good enough" for your first step beyond a single server hosting all your static content in one place.
I'm late to this thread, so my answer will probably just hover alone at the bottom, neglected, sniff sniff.
First off, the right answer to the question is not to answer the question, but to say:
NLB is mature, well suited to the task, and pretty easy to set up. Cloud solutions come with their own pros and cons, which are outside the scope of this question.
Question
Between, say, 2 or 3 static web servers? Yes, it is better than nothing, because there are DNS providers who will integrate DNS Round Robin with server health checks, and will temporarily remove dead servers from the DNS records. So in this way you get decent load distribution and some high availability; and it all takes less than 5 minutes to set up.
But the caveats outlined by others in this thread do apply:
Other solutions
HAProxy is fantastic, but since Stack Overflow is on the Microsoft technology stack, maybe using the Microsoft load balancing & high availability tools will have less admin overhead. Network Load Balancing takes care of one part of the problem, and Microsoft actually has a L7 HTTP reverse proxy / load balancer now.
I have never used ARR myself, but given that its on its second major release, and coming from Microsoft, I assume it has been tested well enough. It has easy to understand docs, here is one on how they see distribution of static and dynamic content on webnodes, and here is a piece on how to use ARR with NLB to achieve both load distribution and high availability.
I've always used Round-Robin DNS, with long TTL, as load-balancer. It works really fine for HTTP/HTTPS services with browsers.
I really stress out with browsers as most browsers implement some sort of «retry on another IP», but I don't know how would other libraries or softwares handle the multiple IP solution.
When the browser doesn't get a reply from one server, it will automatically call the next IP, and then stick with it (until it's down... and then tries another one).
Back in 2007, I've done the following test:
http://roundrobin.test:10080/ping.php
I let it run one hour, had a lot of data. Results were that for 99.5% of the hits on socket A, I had a hit on either socket B or C (I didn't disable both of these at the same time, of course). Browsers were: iPhone, Chrome, Opera, MSIE 6/7/8, BlackBerry, Firefox 3/3.5... So even not-that-compliant browsers were handling it right!
To this day, I never tested it again, but perhaps I'll setup a new test one day or release the code on github so that others can test it.
Important note: even if it's working most of the time, it doesn't remove the fact that some requests will fail. I do use it for POST requests too, as my application will return an error message in case it doesn't work, so that user can send the data again, and most probably the browser will use another IP in this case and save will work. And for static content, it's working really great.
So if you're working with browsers, do use Round-Robin DNS, either for static or dynamic content, you'll be mostly fine. Servers can also go down in the middle of a transaction, and even with the best load-balancer you can't handle such a case. For dynamic content, you have to make your sessions/database/files synchronous, else you won't be able to handle this (but that's also true with a real load-balancer).
Additional note: you can test the behaviour on your own IP using
iptables
. For example, before your firewall rule for HTTP traffic, add:iptables -A INPUT -p tcp --dport 80 --source 12.34.56.78 -j REJECT
(where
12.34.56.78
is obviously your IP)Don't use
DROP
, as it leave the port filtered, and your browser will wait until timeout. So now, you can enable or disable one server or the other. The most obvious test is to disable server A, load the page, then enable server A and disable server B. When you'll load the page again, you'll see a little wait from the browser, then it will load from the server A again. In Chrome, you can confirm the server's IP by looking at the request in the network panel. In theGeneral
tab ofHeaders
, you'll see a fake header namedRemote Address:
. This is the IP from where you got an answer.So, if you need to go in maintenance mode on one server, just disable the HTTP/HTTPS traffic with one
iptables
REJECT
rule, all requests will go to other servers (with one little wait, almost not noticeable for users).Windows Vista & Windows 7 implement client support for round robin differently as they backported the IPv6 address selection to IPv4. (RFC 3484)
So, if you have significant numbers of Vista, Windows 7, and Windows 2008 users, you're likely going to find behavior inconsistent to your planned thinking in your ersatz load balancing solution.
If you were using RR DNS for load balancing, it would be fine, but you aren't. You're using it to enable a redundant server, in which case it is not fine.
As a previous post said, you need something to detect heartbeat and stop hitting it until it comes back.
The good news is heartbeat is available really cheaply, either in switches or in Windows.
Dunno about other OSs but I assume it's there as well.