We have a Windows 2003 R2 Enterprise 64-bit server running a custom workload suffering from an odd performance problem. The pared-down version below suffers from a smaller hump, but it's qualitatively the same.
We've reduced it down to a simple trivial app doing nothing more than:
- listening on a socket
- joining a multicast group
- listening for packets coming in on that group
- reading and discarding the packets
The test application itself is a slightly modified version of the Boost ASIO multicast receiver example so there's not really much that ought to be going wrong. Actual code(!) below…
Every so often while running this program under load, the CPU for this process will ramp up with all of the processing happening in kernel code:
(only CPU 6 is shown here. For the duration of this test (3h17m) all other processors are idle)
As you can see from the graph, when the load spikes hit all of the processing time is happening in kernel code. The time spent is mostly spent in Deferred Procedure Calls (max 16.8%) and handling Interrupts (max 8.5%). It looks like there's some sort of deferred cleanup happening, but we have no idea what it could be.
As far as we can tell it's happening on W2K3E-64 only.
It is happening on different hardware (HS21, HS22, HS22V, HP DL380).
Running the test application on Windows 2008 demonstrates the problem to a much smaller extent (more often but smaller humps).
How can we fix this or where should we look next?
Actual code from the example:
void handle_receive_from(const boost::system::error_code& error,
size_t bytes_recvd)
{
if (!error)
{
++m_receivedPackets;
m_receivedBytes += bytes_recvd;
m_last64TotalBytes += bytes_recvd;
if ( ( m_receivedPackets & 0x3F ) == 0 )
{
printf( "Received %u bytes in %u packets. The average size of the last 64 packets was %u bytes, and the last byte received was %x.\n",
m_receivedBytes, m_receivedPackets, m_last64TotalBytes / 64, m_buffer[ bytes_recvd - 1 ] );
m_last64TotalBytes = 0;
}
m_socket.async_receive_from(
boost::asio::buffer(m_buffer, max_length), m_senderEndpoint,
boost::bind(&receiver::handle_receive_from, this,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred));
}
else
{
std::cerr << "An error occurred when performing an asyncronous read." << std::endl;
m_socket.get_io_service().stop();
}
}
"It looks like there's some sort of deferred cleanup happening, but we have no idea what it could be."
This could be garbage collection, but I am not sure if garbage collection shows up as privileged time. If this is a .NET application, you can look at the
.NET CLR Memory
performance counters (Gen 2 in particular is expensive).In this end guessing possible issues seems a bit backwards. Your best bet would be to profile your application and see what it is doing during this to see what the calls the application is making. You might be able to get away with just using Process Monitor to watch the syscalls.
I assume that the system is receiving multicast packets. Can you try and prevent it from receiving the packets and see if you see the same problem?
What about joining the multicast group but then not listening for packets?
You say its happening on different systems, but what about the actual NIC hardware? It's possible that it's the same on those different systems.
Update: If all of the systems are using Broadcom NICs, it's possible that the problem is with the NIC. In particular, the Microsoft-supplied Broadcom drivers are lousy; the ones at Broadcom's web site are much better.
You could look at two things: your thread quantum and what's causing your DPC (delayed procedure calls).
Thread quantum is very easy to address (probably a red-herring, but may as well check it out);
Most likely Background Services is selected, try selecting Programs. This will decrease the amount of time between interrupts and allow more threads to be run in the same amount of time on the processor. You get more interrupts, but less processing time.
Delayed Procedure Calls are a little more difficult to diagnose;
As stated by @wfaulk this usually points to a driver issue. There's a handy tool called DPC Latency Checker which will help you diagnose these issues. Even though this is happening across multiple hardware platforms they all may still share a driver in common. Run DPC Checker and follow the instructions on their site.
Three follow up questions:
Are you using teamed-NICs? They use the TCP/IP stack to communicate with each other and may cause serious DPC issues.
Do your NICs support TCP Offload? Is it enabled?
(A complete shot in the dark) Is your test server part of a domain? GPOs get refreshed every 90 minutes by default ...