Over the holiday weekend, one of our clients experienced a power outage. When everything came back online, most devices seemed to be fine, but a few (one of our ESXi hosts and a number of VDIs) could not get a proper IP address. They were getting the 169 APIPA from Windows. I looked at the DHCP logs and from the time that the power outage occurred, there had been 0 leases or renewals. It was like DHCP was just idle the entire weekend.
I bounced it and all of the sudden, all the leases started pouring in and everything that was getting APIPA got a normal address and everything went back to normal.
My question is: Is there some sort of setting with DHCP that's causing it to act like this? I feel like a hard outage shouldn't break DHCP, especially if it's getting a fresh reboot.
I'd like to figure out what happened so that if another outage occurs, we don't run into the same issues.
Log timeline:
11/25 11:15 PM, server started after outage:
00,11/25/20,23:12:23,Started,,,,,0,6,,,,,,,,,0
64,11/25/20,23:12:23,No static IP address bound to DHCP server,,,,,0,6,,,,,,,,,0
around an hour later, devices start losing their addresses:
24,11/26/20,00:00:19,Database Cleanup Begin,,,,,0,6,,,,,,,,,0
18,11/26/20,00:00:19,Expired,10.x.x.16,,,,0,6,,,,,,,,,0
18,11/26/20,00:00:19,Expired,10.x.x.18,,,,0,6,,,,,,,,,0
18,11/26/20,00:00:19,Expired,10.x.x.19,,,,0,6,,,,,,,,,0
etc...
A couple hours after that, the entries start being deleted
24,11/26/20,03:12:24,Database Cleanup Begin,,,,,0,6,,,,,,,,,0
16,11/26/20,03:12:24,Deleted,10.x.x.16,,,,0,6,,,,,,,,,0
16,11/26/20,03:12:24,Deleted,10.x.x.18,,,,0,6,,,,,,,,,0
16,11/26/20,03:12:24,Deleted,10.x.x.19,,,,0,6,,,,,,,,,0
etc...
After that, no activity at all outside of database cleanup:
24,11/26/20,21:12:29,Database Cleanup Begin,,,,,0,6,,,,,,,,,0
25,11/26/20,21:12:29,0 leases expired and 0 leases deleted,,,,,0,6,,,,,,,,,0
25,11/26/20,21:12:29,0 leases expired and 0 leases deleted,,,,,0,6,,,,,,,,,0
24,11/26/20,22:12:29,Database Cleanup Begin,,,,,0,6,,,,,,,,,0
etc... (until reboot)
Today, when I rebooted, everything started getting addresses again
01,11/30/20,05:17:21,Stopped,,,,,0,6,,,,,,,,,0
00,11/30/20,05:17:26,Started,,,,,0,6,,,,,,,,,0
55,11/30/20,05:17:26,Authorized(servicing),,<redacted>.net,,,0,6,,,,,,,,,0
10,11/30/20,05:17:26,Assign,10.x.x.16,<redacted>
10,11/30/20,05:17:26,Assign,10.x.x.18,<redacted>
10,11/30/20,05:17:26,Assign,10.x.x.74,<redacted>
etc...
It is not an answer to your question as I feel there is not enough information to really determine what happened.
To prevent issues with DHCP servers many system administrators prefer to give their servers fixed IP addresses. That being said, I do prefer giving as much devices as possible DHCP leases to have a central database of IP address information.
For laptops a short lease time (e.g. 2-4 hours) is okay. The DHCP client will refresh its lease after half the lease time (i.e. 1-2 hours) which is ideal for people who do not work in the same spot 8 hours. You can also set a longer lease time, e.g. 8 hours.
However, for servers and printers, and in general, any DHCP reservation, you can bump up the lease time greatly as they will never get a different IP address anyway. If you set it to e.g. 30 days, it will ask for a renewal after 15 days and if your DHCP servers remains down for more than 15 days you have bigger problems than your other servers not getting an IP address.
I'm not calling this a "Solution" at this point, as we haven't seen another full outage since this last one occurred, but we're testing the idea that (for whatever reason) the NIC isn't finished starting when the DHCP service is started. We've set the DHCP service to a delayed start.
It's the only thing we can think of that might explain the odd behaviour.
I'll update this if it turns out that worked, but it may be awhile since outages of that magnitude don't occur very often.