This problem from what i can tell is isolated to PowerDNS. The servers are running two packages pdns-static-3.0.1-1.i386.rpm
and pdns-recursor-3.3-1.i386.rpm
on the most recent version of Amazon Linux.
The amazon ec2 loadbalancers are assigned a CNAME with multiple hosts. Below is an example of the actual behavior. Notice how the hosts are always in the same order.
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
Expected behavior is round robin for the hosts
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
The addresses eventually do swap but it seems to be on a 30 minute cache timer changing the TTL of the record doesn't appear to affect anything. It appears as though the resolver has a cache of the response. This adversely affects my application because all of the load is only being sent to one of the loadbalancers (Availability Zones) so if I have servers in two zones then only one zone is under load at a time.
Do you know how I can fix this so that each time the host is resolved the order of the addresses is alternating.
DIG OUTPUT
; DiG 9.7.6-P1-RedHat-9.7.6-1.P1.18.amzn1 cache.domain.com ;; global options: +cmd ;; Got answer: ;; HEADER opcode: QUERY, status: NOERROR, id: 54610 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: cache.domain.com. IN A ;; ANSWER SECTION: cache.domain.com. 100 IN CNAME xxxxx.us-east-1.elb.amazonaws.com. xxxxx.us-east-1.elb.amazonaws.com. 3 IN A aaa.aaa.aaa.aaa xxxxx.us-east-1.elb.amazonaws.com. 3 IN A bbb.bbb.bbb.bbb ;; Query time: 0 msec ;; SERVER: ccc.ccc.ccc.ccc#53(ccc.ccc.ccc.ccc) ;; WHEN: Mon Jul 2 15:09:27 2012 ;; MSG SIZE rcvd: 130
Recursor config
allow-from=0.0.0.0/0 dont-query= local-address=127.0.0.1 local-port=530 # Port should be changed to 530 because its not good to run on the same port as dns server quiet=yes setgid=pdns setuid=pdns disable-packetcache= packetcache-ttl=0 forward-zones=domain.local=LOCALIP,domain.cloud=LOCALIP # Forward the two zones we care about back to the local dns server forward-zones-recurse=amazonaws.com=172.16.0.23,compute-1.internal=172.16.0.23 # Forward queries for amazons domains to the resolver for amazon
SOLUTION
add the following lines to recursor.conf
disable-packetcache=
packetcache-ttl=0
add the following line to pdns.conf
recursive-cache-ttl=0