Unfortunately the Linux DNS resolver has no direct support for detecting and doing failovers for DNS servers. It keeps feeding requests to your primary resolving nameserver, waits for a configured timeout, attempts again, and only then tries the second nameserver.
This typically means nearly 30s delay for all request as long as your primary nameserver is unreachable. It doesn't learn to directly target your secondary nameserver so long as there is trouble.
Even with the most optimal configuration, the delay will still be measured in seconds per request. For many requests, that's many more seconds.
I wanted to solve this.
Mainly because over at Transloadit our Amazon EC2 resolving nameserver (
172.16.0.23) is unreachable too often. When it happens it causes big delays and queues in some processes and even downtime as we rely on domain->ip translation. For instance customer could tell us to download 1000 images from different urls, watermark them, upload them to an sFTP server.
I wanted solid failover to Google / Level3's nameservers in case Amazon's went down again. And then failback as soon as possible, because Amazon can resolve
server33.transloadit.com hostnames to LAN IPs where applicable, resulting in lower latency for instance-to-instance communication, when encoding machines need to work together.
But whatever the usecase, there's a need for better way to failover.
Ideally one that does not involve more local proxy daemons, external services, keepalived VRRP IPs, etc. as that would just introduce more complexity and Single Point Of Failures. It should be transparent, archaic and at most rely on
So I wrote nsfailover in bash, and have it replace the resolve-configuration when needed. It's rugged, easy to debug, hard to break, and has been working really well for us so far.
Running it is pretty simple, too. Configuration such as
NS_1 is done via environment variables, here's an example where I set it globally using export:
$ export NS_1=172.16.0.23 $ sudo nsfailover.sh 2013-03-27 14:18:22 UTC [ info] Best nameserver is primary (172.16.0.23) 2013-03-27 14:18:22 UTC [ info] No need to change /etc/resolv.conf
Or maybe you want also want to define the backup nameserver (defaults to Google's), and just pass the config to this process:
$ NS_1=126.96.36.199 NS_2=188.8.131.52 sudo -E nsfailover.sh 2013-03-27 15:01:53 UTC [ info] Best nameserver is primary (184.108.40.206) # Written by /srv/current/stack/bin/nsfailover.sh @ 20130327150153 nameserver 220.127.116.11 options timeout:3 attempts:1 search compute-1.internal 2013-03-27 15:01:53 UTC [emergency] I changed /etc/resolv.conf to primary (18.104.22.168)
Now if you save this in crontab with a timeout:
$ crontab -e # By default, NS_2 is Google, NS_3 is Level3, so only your NS_1 is required: * * * * * timeout -s9 50s NS_1=172.16.0.23 nsfailover.sh 2>&1 |logger -t cron-nsfailover
it turns out, Bob indeed is your uncle :)
logger pipe will make all output go to syslog so to get notified when a failover happens, just scan for
[emergency] in the
cron-nsfailover tag. In our case, Papertrail receives our syslog and I made it report to Campfire when this happens.
There's more documentation available on Github. Let me know if you have any improvements or send me a pull request :)