I'm about to pull out my hair on this one. I have a rails app hosted on Heroku, using Zerigo Free for DNS. The app is available at:
- smartvark.com (preferred)
- www.smartvark.com (some users insist on typing www, this redirects to remove the www)
- smartvark.heroku.com (not given to users but perhaps useful for comparison in troubleshooting)
Users are intermittently (1 in 50 or so requests) experiencing extremely long load times (~2min) and when I try to triage by watching my server logs, their requests don't seem to hit until the end of the wait period. Typical load times for the site are fast around 200-400ms. I am using NewRelic and it isn't indicating any server load issues, although it picks up the end-user issue with its beacon and charts this time as "Network".
Using Firebug and Chrome devtools I am able to see the timeline when this happens on my machine, and they both show a long wait time before any response, which Firebug classifies as "DNS lookup" and Chrome doesn't seem to classify. After the first response happens, the rest of the site loads very fast.
I'm going to guess that this is because your DNS provider (zerigo.net) is publishing IPV6 records for their DNS servers. Your Windows and MAC clients are using a DNS server that has IPV6 enabled, but doesn't have IPV6 connectivity. This causes a DNS timeout trying to access the DNS servers via IPV6 before failing back to IPV4. Trying turning off IPV6 on the DNS resolver and client machines and see if you get better results.
It turns out the problem was that one of the three round-robin IP addresses that Heroku assigned was bad. They figured this out in response to my support request. Thanks to everyone for your suggestions of methods and tools for chasing this down!
For others encountering this problem, this helped explain it:
http://tiwatson.com/blog/2011-2-17-heroku-no-longer-using-a-global-request-queue
Heroku doesn't use a global request queue, so a single long running request can backlock fast running requests.