So my company has a number of domains with a large registrar that shall go unnamed. We are making some changes to our DNS infrastructure and the first of those is we are moving our secondary DNS from one server on site to four servers offsite. So we updated the name servers for each domain at the registrar by removing the entry for the old secondary name server and adding the four new ones. I monitored the old secondary server for requests and when I saw no new requests had been made for 24 hours I shut it down. That was this morning. I assumed at this point everything was good. Unfortunately this was my mistake. I should have gone and made sure name servers at large were returning the correct NS records.
So this afternoon we were performing maintenance on our primary DNS server and we shut it down. This is when I started getting alerts from our external monitoring. I checked and sure enough, the DNS server used there reported the only NS record for our primary domain was the primary name server. The new secondary servers were not listed and neither was the old secondary.
Is it unreasonable of me to have assumed that because the update was from
ns1.mydomain.com
ns2.mydomain.com
to
ns1.mydomain.com
ns1.backupdns.com
ns2.backupdns.com
ns3.backupdns.com
ns4.backupdns.com
in one step at the registrar that there should be no intermediate state where the only NS record was for ns1.mydomain.com?
Going forward to be safe obviously I will always leave the old name servers alone until after I'm 100% sure the new ones have propagated and only then remove the old name servers from the registrar. However, I'd still like to know if my registrar screwed up or if my expectation was unreasonable.
YES.
Generally speaking, it is unreasonable for you to make ANY assumption about ANY change performed through control panel software (except the standard assumption that it's going to screw up somehow).
That includes DNS registrar management interfaces (which are usually pretty awful on the back-end).
The changes you made were probably processed as two separate transactions (one removing the old server, one adding the new ones), and someone got your DNS information after the first transaction, but before the second.
You got bit here because you kind-of Did It Wrong - though in a way that many of us do.
For the future, when decommissioning DNS servers / replacing them with new ones the safe workflow is:
TTL-Dependent, but usually 24-48 hours is a good rule.
You should stop seeing queries going to the decommissioned server.
As in (3), 24-48 hours is a good rule to go with.
That workflow guarantees that the worst-case scenario is that someone will have an extra (lame) NS listed because they're using the "Step 2" information, but they will always have all your new secondaries, so they should always be able to find at least one working name server for your domain.
You combined steps 2, 3, 4, and 5 into one step, and on the back end the removal (4) happened before the addition (2).
Chances are that would never have caused a problem except for your maintenance happening before everyone caught up with the "addition" part of the changes. It's a classic edge case and you landed on it.
Now you know, and knowing is 7/16ths of the battle.