CPU consumption goes very high (which obviously shouldn't) when watching video on Chrome. This is due to some regression that has appeared recently (it wouldn't happen before), but that is not the point here. Let's just assume that something is consuming a lot of CPU.
I disabled intel_powerclamp and intel_rapl, so when CPU consumption gets high, I would expect the cooling fan to speed up as fast as it has to in order to cool down the CPU.
Instead, what happens is that the fan NEVER reaches its top speed, not even close to that. And the whole system slows down and becomes unresponsive, in a very similar fashion as it used to when intel_powerclamp was enabled. But no "kidle_inject" processes show up in top.
The PROOF that the cooling fan is not spinning as fast as it should is that if I restart the computer, before the system starts to boot (at the boot menu), the fan spins at crazy speed. That proves (1) that it is capable of going faster than it does while the OS is running and (2) that the hardware "thinks" the temperature does require it, until the OS kicks in and thinks it knows better.
I can only see two explanations for it:
A) some software configuration is limiting the fan speed (perhaps because it is configured under the assumption that intel_powerclamp was there too and would do half of the job). So, because the fan at its current speed isn't sufficient to cool down the CPU, and since intel_powerclamp is not there limiting CPU use, some internal hardware protection of the CPU kicks in and throttles it, to prevent it from burning (or reaching the hard limit that would cause it to turn off abruptly) OF
B) besides intel_powerclamp and intel_rapl which I have disabled (I understand how intel_powerclamp works, I have no idea how intel_rapl does), there is some other driver that works in some similar way, slowing down the CPU to make it consume less power.
In case (A) I would need to fix the configuration so that the fan is allowed to run at its maximum capacity and see if that is enough to keep the temperature down and prevent hardware protection to kick in.
In case (B) I probably want to try either
[B1] disable whatever other software or semi-software based CPU throttling mechanism is doing a poor job, and see if the fan alone is enough, or
[B2] somehow tweak some weight factor or threshold or priority configuration so that the fan is allowed to run faster BEFORE the CPU throttling kicks in (and hopefully even prevents it from happening).
Does anyone know how to figure out whether it's (A) or (B) and how to fix this?
I read about configuring Thermald at https://wiki.ubuntu.com/Kernel/PowerManagement/ThermalIssues but it's terribly unclear and there are no practical examples.
I just went through this exercise with thermald. First, I should point out that thermald was broken, even in fresh versions of 16.04, as it would not read its configuration file. This has now been fixed, so make sure that you have all of your updates installed.
Also, the page you reference is pretty good, and also contains an example thermal-conf.xml configuration file.
Next, assuming that you've got thermald installed, stop the process, and restart it in --no-daemon mode, and carefully watch the output. It'll give you most of the answers that you need to configure your own thermal-conf.xml file. Watch for cdev (cooling device), etc.
sudo service thermald stop
sudo thermald --no-daemon --loglevel=debug
Here's a copy of my custom thermal-conf.xml file for you to look at: