The other day, we notice a terrible burning smell coming out of the server room. Long story short, it ended up being one of the battery modules that was burning up in the UPS unit, but it took a good couple of hours before we were able to figure it out. The main reason we were able to figure it out is that the UPS display finally showed that the module needed to be replaced.
Here was the problem: the whole room was filled with the smell. Doing a sniff test was very difficult because the smell had infiltrated everything (not to mention it made us light headed). We almost mistakenly took our production database server down because it's where the smell was the strongest. The vitals appeared to be ok (CPU temps showed 60 degrees C, and fan speeds ok), but we weren't sure. It just so happened that the battery module that burnt up was about the same height as the server on the rack and only 3 ft away. Had this been a real emergency, we would have failed miserably.
Realistically, the chances that actual server hardware is burning up is a fairly rare occurrence and most of the time we'll be looking at the UPS the culprit. But with several racks with several pieces of equipment, it can quickly become a guessing game. How does one quickly and accurately determine what piece of equipment is actually burning up? I realize this question is highly dependent on the environment variables such as room size, ventilation, location, etc, but any input would be appreciated.
The general consensus seems to be that the answer to your question comes in two parts:
How do we find the source of the funny burning smell?
You've got the "How" pretty well nailed down:
You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:
When should we troubleshoot versus hitting the Big Red Switch?
This is a more interesting question.
Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating.
You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.
Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives.
Troubleshooting "that funny burning smell" should never take precedence over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.
The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.
If you see smoke or fire, drop the room
This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system.
Exceptions may exist (exercise some common sense), but this is almost always the correct action.
If you're proceeding to troubleshoot, always have at least one other person involved
This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).
Exercise prudent safety measures while troubleshooting
Make sure you always have an escape path (an open end of a row and a clear path to an exit).
Keep someone stationed at the EPO / fire suppression release.
Carry a fire extinguisher with you (Halon or other clean-agent, please).
Remember rule #1 above.
When in doubt, leave the room. Take care about your breathing: use a respirator or an oxygen mask. This might save your health in case of chemical fire.
Set a limit and stick to it
More accurately, set two limits:
The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.
Trust your gut
If you are concerned about safety at any time, call the troubleshooting off and clear the room.
You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.
If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)
A Thermal Imaging Camera could do the work, and let you identify where the overheating is. A device like this would let you identify also the origin of a fire or burning in a smoke filled room.
You do none of these things that have been said. You leave the hazardous environment because whatever is being pumped through the entire room is dangerous to your health and may really mess up your lungs. If there is an acrid smell of something burning in the room that you can't find, call (911|112|999|whatever emergency number fits your jurisdiction) and let the fire (company|department|brigade) sort it out while they're on bottled air.
Computer parts contain all sorts of interesting chemicals including mercury, cadmium, lead, and lots of plastics in casings. Notice that all the links I made explain how low level exposures can cause lasting damage or even quick death. This is an environment that can be immediately dangerous to life and health.
... so really, if something is burning, don't spend hours sniffing the fumes. If you can't identify it and immediately act to contain it, get out.
If you had proper monitoring on the UPS (usually via SNMP), the unit itself should have rung the bells on your monitoring system. If it didn't, talk to your vendor about that. It either malfunctioned or your monitoring system isn't properly configured.
If something active is actually burning, it should be complaining about it in some way, or simply be off the network, which should also cause an alarm.
If it's something like an actual power rail burning through insulation, and it's not on a smart PDU, then we're back to your original question, which is "how do I find a burning thing?" And I think the proper answer is "Hit the EPO and figure it out. Your production servers are probably not important enough to go risking lives."
This is one of those situations where
doesn't apply, you should call a professional
Anything else is just plain stupid.
As someone whose former career was as an electronic tech, I have experience with "burning smells" that were not fires. This isn't uncommon.
I wouldn't shut down a data center for a smell. Smoke is another matter, something is really burning (usually, but a pea-sized tantalum capacitor can fill a room with smoke too). It's amazing how much smell a fried component in a power supply can make.
A TIC or IR thermometer (a useful tool and a lot cheaper than a TIC) would not necessarily show it as the component doesn't generate much heat at all and it's inside a case. But check for devices not working, use you monitoring tools. For a smell like that then 95% of the time it'll be a power supply affecting the performance of the whole device.
I like the IR imaging or thermometer answers but maybe what would also help is a real "odor detector". After all what triggered your caution was the smell. Smoke, heat, IR etc. are all surrogates.
Something like this one: . I've personally never used them or even seen them used in a datacenter. But at least theoritically it should be a neat tool. If you have the money to spend on this gizmo that is.
http://www.sca-shinyei.com/odormeter or http://www.intopsys.com/products/cyranose.html?gclid=CNXXzOrLs7YCFUws6wodViYApQ
It gives you an odor strength as well as classification. So homing in onto the odor should be possible. Devil's in the details of course. How sensitive it is, masking out spurious background odor etc.
One advantage over purely temperature based measurements is that often odor occurs at a far earlier point or threshold. Or if the overheated component is hidden by a body / concealed wiring etc. it is easier to detect molecules escaping than a line-of-sight hot spot.
Another situation is a non-heat related smell. We've had a cooling circuit leak before and the coolant smells were peculiar too. I won't even go into the now ancient case of a rodent dead in the ducts. :)
I was surprised how sensitive these sensors are. Apparantly H2S / mercaptans etc. (usual culprits) are detectable at sub ppm levels.