I have encountered a performance regression running the python interpreter when upgrading from ubuntu 18 to ubuntu 24. What do you think causes this? Is there a fix or workaround?
I have some evidence that suggests changes to the linux kernel--and not anything in userland--are likely responsible for the problem.
I can reproduce the performance discrepancy with a very simple test:
python3 -c "import timeit; print(timeit.Timer('for _ in range(0,1000): pass').timeit())"
I am running this test by booting off ubuntu 18 live server and ubuntu 24 live server ISOs on the very same Cisco UCS C220 M5SX rack system. You can find the live server ISOs on Canonical's site and reproduce the experiment.
I have a set of python 3.11 binaries built from source on a debian squeeze system that will run on a huge variety of linux distros. This gives me the opportunity to test the very same python binaries on 18 and 24. We will call these python binaries python_pegged. We will call the python3 fetched by 'apt-get' python_sys.
I have also tried running the ubuntu:18.04 docker container from within ubuntu 24 -- it has the bad performance characteristics of ubuntu 24 and does not behave like ubuntu 18, which leads me to believe userland is not responsible.
Experimental results:
Ubuntu 18.04.6 bare metal python_sys 13 seconds
Ubuntu 18.04.6 bare metal python_pegged 13 seconds
Ubuntu 18.04.6 bare metal sysbench --test=cpu run 1288 events/s
Ubuntu 24.04 bare metal python_sys 83 seconds
Ubuntu 24.04 bare metal python_pegged 112 seconds
Ubuntu 24.04 bare metal sysbench --test=cpu run 925 events/s
ubuntu:18.04 docker container hosted by Ubuntu 24.04 python_sys 82 seconds
ubuntu:18.04 docker container hosted by Ubuntu 24.04 python_pegged 112 seconds
In ubuntu 24, taking the following actions had zero effect:
- set scaling_governor to performance
- tuned-adm profile throughput-performance
- tuned-adm profile virtual-host
- tuned-adm profile balanced
The system has 40 physical cores and 80 hyperthreaded. I tried running various numbers of concurrent instances.
concurrent instances seconds
40 82..83
80 53..53
120 87..115
I am surprised that 80 instances are faster than 40. I ran the experiment a couple times and the results never changed. I tried different tuned profiles with no effect. 53 seconds is still a far cry from ubuntu18's 13.
python is tripping over something expensive with ubuntu24 that sysbench is not--or at least not to the same degree. I may start using a profiler in the near future to dig deeper.
Testing of various intermediate versions indicates the regression occurred in the 21 -> 22 upgrade.
perf stat -a results on the python3 command:
Stat | Ubuntu 20 | Ubuntu 22 |
---|---|---|
cpu-clock | 80% | 80% |
context-switches | 0.003 K / sec (3829) | 6.656 / sec (45238) |
cpu-migrations | 0 K / sec (3) | 0.039 / sec (268) |
page-faults | 0.001 K / sec (620) | 0.151 / sec (1029) |
cycles | 0.047 GHz | 0.034 GHz |
instructions | 3.34 insn / cycle | 0.88 insn / cycle |
branches | 30.703 M / sec | 5.329 M/sec |
branch-misses | 0.03% | 13.18% |
Some big differences there. I'm not convinced 6 context switches per second really moves the needle. Instructions per cycle is presumably the most relevant data point here.
TLDR: Use mitigations=off on the kernel command line to reduce the slowdown to 40% slower instead of 600% slower. Only do this if you are comfortable with the negative security implications. I am still interested in reclaiming the remaining 40% slowdown if anyone has ideas.
Long version:
I isolated the regression to a change introduced between linux-5.15.0-43-generic and linux-5.15.0-46-generic. Upon reviewing the patch file showing the difference between those two kernels, the thing that stood out were a bunch of cpu bug mitigations. Even though I thought my colleague had already tested whether disabling mitigations helped, I went back and retested. This was indeed the cause of the problem--so there was some miscommunication here.
For historical reference, another way to reduce the slowdown by about the same amount is to disable hyperthreading in the BIOS. However, this will obviously eliminate any benefits that hyperthreading provides.