- Published on
Let's Make DNS Outage Suck Less
- Authors

- Name
- Kevin van Zonneveld
- @kvz
Unfortunately the Linux DNS resolver has no direct support for detecting and doing failovers for DNS servers. It keeps feeding requests to your primary resolving nameserver, waits for a configured timeout, attempts again, and only then tries the second nameserver.
This typically means nearly 30s delay for all request as long as your primary nameserver is unreachable. It doesn't learn to directly target your secondary nameserver so long as there is trouble.
Even with the most optimal configuration, the delay will still be measured in seconds per request. For many requests, that's many more seconds.
I wanted to solve this.
Mainly because over at Transloadit our Amazon EC2 resolving nameserver (172.16.0.23) is unreachable too often.
When it happens it causes big delays and queues in some processes and even downtime as we rely on domain->ip translation. For instance customers could tell us to download 1000 images from different urls, watermark them, upload them to an sFTP server.
I wanted solid failover to Google / Level3's nameservers in case Amazon's went down again. And then failback as soon as possible, because Amazon can resolve server33.transloadit.com hostnames to LAN IPs where applicable, resulting in lower latency for instance-to-instance communication, when encoding machines need to work together.
But whatever the use case, there's a need for better way to failover.
Ideally one that does not involve more local proxy daemons, external services, keepalived VRRP IPs, etc. as that would just introduce more complexity and Single Point Of Failures.
It should be transparent, archaic and at most rely on crontab.
So I wrote nsfailover in bash, and have it replace the resolve-configuration when needed. It's rugged, easy to debug, hard to break, and has been working really well for us so far.
Running it is pretty simple, too. Configuration such as NS_1
is done via environment variables, here's an example where I set
it globally using export:
$ export NS_1=172.16.0.23
$ sudo nsfailover.sh
2013-03-27 14:18:22 UTC [ info] Best nameserver is primary (172.16.0.23)
2013-03-27 14:18:22 UTC [ info] No need to change /etc/resolv.conf
Or maybe you want also want to define the backup nameserver (defaults to Google's), and just pass the config to this process:
$ NS_1=8.8.8.8 NS_2=8.8.4.4 sudo -E nsfailover.sh
2013-03-27 15:01:53 UTC [ info] Best nameserver is primary (8.8.8.8)
# Written by /srv/current/stack/bin/nsfailover.sh @ 20130327150153
nameserver 8.8.8.8
options timeout:3 attempts:1
search compute-1.internal
2013-03-27 15:01:53 UTC [emergency] I changed /etc/resolv.conf to primary (8.8.8.8)
Tight!
Now if you save this in crontab with a timeout:
$ crontab -e
# By default, NS_2 is Google, NS_3 is Level3, so only your NS_1 is required:
* * * * * timeout -s9 50s NS_1=172.16.0.23 nsfailover.sh 2>&1 |logger -t cron-nsfailover
it turns out, Bob indeed is your uncle :)
The logger pipe will make all output go to syslog so to get notified when a
failover happens, just scan for [emergency] in the cron-nsfailover tag.
In our case, Papertrail receives our syslog and I made it report to Campfire when this happens.
There's more documentation available on Github. Let me know if you have any improvements or send me a pull request :)
Legacy Comments (24)
These comments were imported from the previous blog system (Disqus).
We also use nice hostnames for things like our database servers, but we configure those through /etc/hosts to avoid issues. Is there a reason you want to do a full lookup instead?
Hello Paul, glad you ask. Yes, a few reasons:
- Our platform is highly volatile, so that means rewriting our /etc/hosts a lot. Using what datasource? Is that centralized? A SPOF? Are we now just building a poorman's DNS system? Incorrect information could be in there.
- More importantly: some failover mechanisms rely on CNAMES. E.g. after an Amazon RDS Multi-AZ failover our database server record would point to an invalid local IP if we kept that in /etc/hosts. Yes, I have been there : )
- Not in our case, but there also is the possibility you're relying on external hostnames (e.g. APIs without fixed IPs)
Why are you relying on DNS for DB? That's rule #1. Don't use DNS for DB communication. You add an unnecessary layer of complexity to the mix.
Using IPs for communication to Amazon RDS instances is inadvisable. You'd have to use WAN addresses outside of EC2, and LAN addresses inside of EC2. Addressing them by domain resolves this. Additionally I don't think Amazon guarantees your box will be accessible on the same IP addresses after Multi-AZ failovers. I've had problems with this.
If you have to use different IP's for different access you still have that issue with DNS since the records have to point to an IP. Assuming you are already doing outside.db1.etc.com that does not matter much.
Same applies to IP assignment issue, your DNS will still point to a wrong IP if AWS switches your server without notification.
No we're talking about the Amazon nameserver here 172.16.0.23. It gives out LAN IPs to LAN requesters, WAN IPs to WAN requesters. Furthermore as said, Amazon could change the IP after a failover, even just locally. So mydb.cxnb1emspje3.us-east-1... first points to 10.118.132.81, then points to 10.218.131.72 inside the LAN. While outside switching from 184.71.171.81 to 182.71.161.1. Their nameserver is goofy like that (for good reason)
How about timeout & attemps options on resolv.conf ?
or i am missing something ?
Unfortunately it will keep trying the primary server with every request. Meaning every request will have your attempt+timeout penalty. Even if you drop the timeout to 1 second (which could result in false positives), if you make 5 requests, that's 5 seconds. For as long as your nameserver is unreachable.
Maybe you should look into anycast DNS
This little program is for DNS consumers (even though my servers are servers, they are DNS consumers).
What you're suggesting is something a DNS provider should do. Google and Level3 have it, so those are my default fallback servers now (it's still possible you cannot reach one of their networks though if they e.g. mess up their bgp config).
As for Amazon, their resolver must be addressed locally here as it will decide to hand you 10.0.0.1 or 123.123.123.123 for their instance hostnames, based on who's asking.
Obviously there are many more cases where you want to rely on a local nameserver if it's available. For instance I know that Spotify actually uses DNS as a distributed datastore for service discovery & configuration management http://labs.spotify.com/201...
ncsd is your friend :)
and so is spellcheck, i meant to say "nscd" hehe.
Caches still don't solve request timeout penalties for hostnames that are not yet known right? e.g. what happens on a freshly launched machine when the nameserver goes down
What are you doing? Best practices exists for a reason.
I am not sure if you want a local resolver or if nscd would suffice, but don't do crazy stuff like this.
First understand why things work the way they do. _Then_ invent your own stuff. Not the other way around.
If you know a better way please share and we'll discuss! You mention a local resolver. Why install a daemon to solve this problem? A daemon can crash. Must be maintained, monitored. How can I check it's reasoning. Will it cache Google's records, when in fact Amazon's nameserver maybe online again and ready to provide the preferred LAN addresses. By adding another service I'm inviting a ton of extra problems to deal with. As for nscd, it was briefly discussed in another comment already, but when I researched it it only did caching. What about new hosts being introduced during the outage?
I know as engineers we have the tendency to fill our landscapes with more & higer tech, but while stepping back I found that lower tech actually solves more problems than it creates here. It's transparent and robust. It's easy to follow/debug. Hard to crash. I only have to trust on 2 pieces of technology. cron. and 30 lines of bash. And this gives me the best resolving the internet can offer at any given minute.
This sounds like a problem dnsmasq could solve. You just have to configure it to check all DNS servers and it'll return the answer provided by whoever answers fastest.
Thanks for the suggestions. That does require proxying your requests through another piece of technology though. Another component can fail or get confused. In that sense it's less robust / transparent imho.
Also, can you also set preferences? In case of Amazon's nameservers I always want theirs, unless it's down (not slower) (Amazon can return LAN addresses were applicable for known hostnames, which lowers latency for instance-to-instance communication).
Feels like a bit complex solution but I see the problem. I actually installed pdnsd for this issue. Just one package in your default install and it should cover your issue. `apt-cache show pdnsd`
It's a cool package, thanks. And at first sight it seems like the proper/elegant solution, I agree.
But it might be throwing more code at this problem than needed. mo code, mo problems (http://www.codinghorror.com... just checkout it's recent updates: http://members.home.nl/p.a..... It's a good thing that it's maintained, but after 11 years there are still bugs found. So, what if it crashes (we'll need more software to keep it running), makes mistakes, or keeps around records provided by the secundary nameserver even though the primary is back online now and has better records and we prefer those for lower latency (in case of Amazon)?
It's slightly simpler to install but architecturally, proxying requests through another component is not, and in fact is introducing yet another single point of failure. Adding links to a chain means introducing new risk of weakening it.
all software has bugs, probably even your solution, you just have to find them yet.
Its not much more then adding a dnsresolver to your server, but this time with extra caching capabilities. That is more like moving an essential service closer into your own zone of control.
For servers with many dns lookups it makes sense to run a local dns resolver. For servers which can't do without, adding caching to this step seems a prettry straight forward solution.
I actually don't use the amazon stuff and it seems you would need further testing to prove your worries right or wrong. The hesitation to add another service seems like 180 degree turn compared to deploying everything in amazon and depending on their resolvers.
Yes, I'm just saying: more code = more bugs. And moving something in between linux & resolvers that can crash (for instance because of the dangling pointer bug they had), adds additional reliability. On top of that, it also caches records of less favorable resources.
The reason we chose amazon is that sometimes we need to scale up to 100 servers. That's hard using other vendors. I don't see however how that's a reason to add more points of failure when I don't strictly have to.
As someone who has administered systems for 15+ years, I agree this is the right approach. It's the least complex and if crontab fails then resolution continues to work as well as it always has.
It requires me to make almost no change to my existing configuration and install no software which is helpful in heterogenous environments with multiple distros, software vendors, and departments and politics and so forth.
The only thing better than your solution would be if the Linux DNS client code were enhanced to operate more like your script causes it to operate. :)
It's interesting to note that Windows does not suffer the same shortcomings as Linux in this regard. Your script actually causes the Linux resolver to operate in a manner more similar to the approach MS took, which is to lower the priority of servers that are not answering queries.
Host file is never the right approach, unless you check the calendar and you see the year is 1969 and you're administering systems connected to ARPANET. There are many scenarios where A, PTR, CNAME, MX, TXT, SRV and other records are the right solution for name resolution. DNS is used in host name resolution as well as high availability failover scenarios. Having that central database and using it anytime two systems need to talk is definately the right way to setup systems in regard to best practices. Caching is always used to resolve any performance issues. NSCD is the most common example of that. Again, nscd is for performance - not failover mitigation. Anyone that manually edits their hosts file for name to IP mappings probably just doesn't understand DNS, or has a very small environment to manage.
Another thing I wanted to mention. I saw this in the nscd or nscd.conf man page: if you make changes to resolv.conf you need to restart nscd for those changes to be picked up by nscd. See the man page for details, I do not recall the specifics. Basically, this may be important info for anyone using this script with nscd. This script solves the failover problem, but we still need something like nscd or sssd to cache lookups for performance reasons. That being said, caching lookups might be undesirable in certain environments.
I really love the idea of your script. And I believe that this is the most non-disruptive way to fallback to secondary DNS servers. But how to get it work with a resolvconf-based infrastructure (as opposed to the traditional resolv.conf)?