I have a dual Opteron server running Linux with libvirt to host several VMs. The VMs work fine and the server processes OK, but I notice one CPU always runs about 69C (throttles at 70C) and the other runs about 15C.
This doesn't seem normal to me? Shouldn't they both be a little closer in temperature?
I'm not sure how to dianose any further. Maybe there isn't enough thermal paste on one of the CPUs?
Edit: The motherboard is ASUS KGPE-D16 and cooled by dual Noctua NH-U9DO fans.
Note that I think the temperatures might be degress above ambient, rather than absolute values? When the server is idling, the CPU temperatures drop to 2C and 13C. I am using the lmsensors configuration from here
The problem ended up being a poorly fit heatsink. Maybe poorly fit isn't the right description. Turns out, you have to put thermal paste on the heatsink, not the plastic cover that goes over the heatsink.
After removing the plastic cover, the CPU is nice and cool, thanks everyone!
In my experience, it is normal for paired components in a case to run at different temperatures, because airflow is not the same everywhere. Here's a graph of HDD temperature from my colo box. The drives are mirrored, so the workloads on them are near to identical.
As you can see, they track each other, but they're not the same; they're also, on average, only 6C apart. Whether your sensors report absolute temperature or overtemperature, a difference of 55C under load seems very badly wrong. If you have confidence the data are right, then given the quiescent difference drops to 10C, which is the sort of difference I see due to airflow, I'd suspect a poorly-fitted heatsink.
It is not. Unless you have some serious issues with the airflow. Or one of the coolers is bad. Temperature WILL vary - but not that much (70 vs. 15 degree celsius).
Given how low 15 degree is I would assume (a) your sensor is off (you really store the server in a that cool room?).
I would also assume one of the CPU does simply no work at all, for whatever reason.
Small differences are normal. Some little larger ones may be (airflow coming to my mind). but here we talk about one being COLD.
This could be either cooling or uneven loading (given the temp difference your situation is probably uneven loading). You should use something like prime95 to load all the cores evenly and see if the temps still vary. If they don't then you need to balance the VMs, check that your apps are multithreaded and busy. How to do that depends on your software and individual workload so is beyond the scope of the question really. Bear in mind there is no real advantage to doing this if you don't have enough load to top out a single cpu/core, in fact your VM may deliberately avoid using a second cpu so that it can go into power saving modes on multi-cpu systems.
If you have narrowed it down to cooling. A small difference of upto 10C could be too little (or too much!) thermal paste. A bigger difference indicates a significant problem or difference between cpu coolers. It could be that one has blocked airflow, a heatsink has been knocked loose, etc.
I would have to concur with, defective temp. sensor, as 15C is only 59F!!! Unless the computer's in an extremely frigid datacenter, I would imagine the ambient air temperature would be higher than 59F! You try to assign the VM's to the low temperature core and see if there is any change; if not, I would highly suspect the sensor as being faulty.
You may also want to look at the output of
dmesg
(boot messages) and see if there is anything out of the ordinary there.