Background
I have a Windows DHCP server (Server 2008 R2) handing out addresses for several scopes. One of those scopes is for some Mitel IP Phones. The phones are configured to use dhcp option 125 to get configuration information. When a phone starts up, it doesn't know what vlan to use, and so it just gets the default (untagged) vlan of whatever port it's connected to. The dhcp server gives it a response that includes option 125 information, and the phone is able to read what vlan it should use from this response. The phone then releases its original address and requests a new dhcp lease using the correct vlan tag. The phones also usually have computers connected to a pass-through port. The packets from the computers are never tagged, and so the PCs will stay on the original (untagged) vlan for the port. This has worked for us for years.
Problem and Symptoms
Somewhere in the last several weeks, something changed, and I'm not sure what. The phones will continue to work as long as they do not restart, meaning dhcp renew requests must be processed correctly. Phones connected to certain switches can even a survive a restart. Phones connected to other switches, however, will fail to complete the process when they reboot. All of our phones are using PoE that is backed up by a UPS, so it's been a long time since any have restarted. This means I have no idea when the problem first appeared. What I do know is that one phone failed when it restarted yesterday, and in troubleshooting it today we reset that switch closet. Now none of the phones on that switch are working (thankfully it's still a small number). I also know that things were working near the end of the January, when we moved a phone for an injured user to a temporary workspace on the ground floor.
As I watch a phone boot up, I can see it successfully get the first address. It then successfully reads the option 125 information, sets the correct vlan tag, and releases the original IP lease. It is even able to receive and accept an offer on the correct vlan from the server. However, that's where things stop. The phone has a message on the screen that says, "DHCP: Offer 2 ACC
", but the Windows DHCP server has not recorded the lease and the phone never moves on. I can only guess that the DHCP REQUEST packet never reaches the Windows server, and so the phone is waiting for the final ACK from Windows that it's okay to continue.
Workaround
I was finally able to get a phone working again. To do it, I had to first disconnect the computer. Then I set the phone's switch port to be untagged on the phone vlan, with no membership on the PC vlan. The phone will now reboot correctly. At this point, I can put the switch port configuration back where it should be, and as long as no one tries to call that number as I'm resetting the port, the phone never misses a beat. Then I can reconnect the computer. Obviously, that's not an ideal process, though since phones reboot so rarely I will be able to use it to get people working again until I can find the root cause. Offices are closed now for the week, and so this issue will actually be allowed to sit over the weekend (I don't have keys for individual offices where the phones are).
This phone I fixed is the service phone in the server room, connected directly to our core switch. It is possible the problem is an issue with routing or processing tags on the core switch, such that the workaround will not be effective on the remote offices where packets are first passed through (tagged by) other switches, but I'll be very surprised if that happens, given that I know it must be processing dhcp renewals and actual phone conversations correctly.
A twist is that leaving the port tagged on the PC vlan means that phone instead fails with the message "DHCP: Offer 1 ACC
". I need to remove that vlan entirely for this to succeed.
Note: I have now confirmed that the work-around is effective in remote buildings. This leads me to suspect that my devices are somehow not assigned to the correct vlan. That fact that I experienced the problem on my core switch, and that it happened in several places on the network at about the same time, indicates that the core switch may be the problem. With nothing specific to look at, I'm scheduling a maintenance window near the end of the week to reboot the switch. I may also update the firmware.
Environment
Our core switch is an HP 5406zl. This switch handles inter-vlan routing. The Windows DHCP server is connected directly to the switch. Endpoint switches are connected to the core switch via fiber SFPs, and these ports are tagged for all vlans on both ends. The core switch configures each vlan with an ip helper-address
setting that points it to our DHCP server, and a dhcp relay-option 82 replace
line so that the dhcp server will know what scope to use. These configurations, and the port configurations on the endpoint switches, have not changed in at least 16 months. We have had other switch and phone resets in that time.
Most of our endpoint switches are HP 2530 series. These switches seem to work correctly (phones on 3 different 2530's have restarted correctly today). It's older switches that have problems. We have one old 3Com 4200 and one 4210 that will not work. The service phone connected directly to the core switch mentioned earlier also would not work.
Question
At this point my best guess is that a Windows update on the dhcp server changed the behavior, but I can't see how. Or possibly the core switch is not handling that REQUEST packet correctly, but I'm sure that nothing changed there, and it doesn't explain why only certain endpoint switches are effected. How can I resolve this issue?
Update:
Here is a dhcp log excerpt from a failed phone:
10,03/06/15,12:40:40,Assign,10.1.2.158,,08000F197844,,3189088995,0,,, 11,03/06/15,12:40:40,Renew,10.1.2.158,,08000F197844,,3189088995,0,,, 12,03/06/15,12:40:41,Release,10.1.2.158,,08000F197844,,3189088995,0,,, 15,03/06/15,12:40:45,NACK,10.1.2.154,,08000F197844,,0,6,,, 15,03/06/15,12:40:45,NACK,10.1.2.154,,08000F197844,,0,6,,,
The 10.x.x.x addresses are the PC vlan (that choice pre-dates me at this place). Phones should get that kind of address at first, so that's expected. However, after the release message I also expect to find an offer for an address in the 192.168.16.x range, because I can see on the phone that an offer was accepted (unless I'm misinterpreting "ACC"). It's interesting that I never see the server try to issue an address like that, even though the phone thinks it received one.
I considered the idea there's a rogue dhcp server on the network (it hands out an address before the Windows server, but without the dhcp options needed by the phone to continue), but that doesn't explain why the phones work if and only if I completely remove any path to the PC vlan. I'll test for it anyway in the morning by connecting my laptop to a port set for the phone vlan, but if anyone else has a better explanation in the meantime, I'd love to hear it.
Here's a copy of the switch config:
I fixed the problem today by removing the vlan tag for the phone vlan on the port connecting to our dhcp server. It's very strange to me that this worked, as other systems that use a similar scheme (aka: Wifi SSIDs using 802.1q) require the tag or clients cannot get addresses. It worked, so I won't look too hard, but I would be interested in seeing answers with theories for why this is the way it is.
You should consider running a packet capture on either side of the problematic switch(es) and then reviewing this in Wireshark. This will be able to tell you 1) if traffic is being intercepted by a rogue DHCP server (based on MAC address) and 2) if something is getting mangled or dropped (eg, maybe you need DHCP relay). This may require port mirroring, or the 3com may support capturing directly on the switch.
If you find that this issue pops up again, you may want to check the size of your DHCP scope and how many leases are in use. If old DHCP leases are not being destroyed, your server may think that there are no addresses left in the pool and be unable to assign new addresses. This is true even if there are no devices responding in the vlan. If your DHCP scope is 7 days, it could be up to 7 days before you are able to get a new lease. Likewise, changing your configuration around would resolve the issue because there would be a new range of addresses which could be dished out, or it may flush the leases depending on the configuration changes. I would suggest setting the lease lifetime to something very low, like an hour for that scope if this is the case. You can confirm this by manually removing a lease and seeing if a phone is now able to pick up a new address if the issue pops up again.