Investigating some call quality issues (0.5 – 1 second dead spots in calls) I took a packet capture of a phone call between two extensions on the same PBX. Since I was capturing from the PBX, I was rather surprised to see Wireshark reporting a huge spike in jitter that synced up with a dead spot in the call:
My understanding was that jitter is caused by packet loss and/or latency in transit, and that the RTP stream leaving the PBX should be relatively pristine. But this spike showed up in all four RTP streams (office 1 to PBX, office 2 to PBX, PBX to office 1, PBX to office 2) so it seems like the packets are already in poor shape by the time they leave the server.
The PBX is Asterisk 13 on Scientific Linux (RHEL) 6.9 (running on a VMWare ESXi 5.5 guest with newly updated tools and VMXNET3 adapters.) The CPU sits pretty steadily around 5-15% usage, and network traffic is minimal. Where can I look to troubleshoot this issue? Are there any common causes for this sort of problem? I'm assuming since the problems are there on the server that I can rule out problems on the external network side?
Finally figured this out! TLDR: disable power management on the host.
Despite the low CPU usage, we still figured this was something to do with CPU load. So, we were experimenting with loading down the CPU, expecting this problem with the dead spots in the calls to get worse. Instead, it went away completely. So, after looking at CPU usage stats in vCenter many many times, I finally looked into the other line on that graph.
This is probably not news to many, but I found out that CPU ready time is the amount of time that a VM is ready to use the CPU, but the physical resources can't be allocated by the host. Most sources I found say that anything less than 5% isn't a problem, but it certainly seemed to be having an impact on our voice streams. We were seeing the cutouts every minute, and the graph also showed a spike in ready time every minute.
So I got to wondering why this would go away during high CPU load and figured it must be some kind of power management. When the host sees the increased usage it makes CPU resources consistently available to the VM. So I disabled power management in the BIOS of the host, et voila:
The slight increase in ready time near the end of the graph corresponds to a half-dozen other VMs migrating back to this host.
Call traces now show negligible amounts of jitter, and the cutouts have disappeared from calls. Further research showed this is a somewhat common issue with workloads that are both latency-sensitive and CPU non-intensive. The power management sees the very low CPU usage and assumes it can throttle the processor, even though it should not!
I had a similar issue, but much worse, with many spikes in the Wireshark RTP graph, hisses and choppy audio.
In the course of many experiments, I dumped the CDR database, that had grown to 1.5GB. I had noticed the size, but was putting off the pruning until I had fixed the audio problems. B-)
This apparently immediately improved audio quality, including the transcoding of the IVR messages to G729.
The delays were also visible from a SmokePing trace to the VPS.