We've got a pretty large MSMQ environment setup which today decided to grind to a halt.
(Everything is a VM under vSphere 4.0 Update 1)
There are 8 Web Servers which receive data from clients on the net. These machines all have MSMQ installed and simply send the MSMQ message to the main MSMQ server. Messages are currently piled up in the outbound queue. These machines are Windows 2008 Web Edition with 2 Gigs of RAM and 2 vCPUs.
We have a Clustered MSMQ server (Windows Cluster Server) which gets the messages from the 8 web servers. There is no limit on the amount of data that can be in the queues. The hard drive is 50 Gigs, and there is 46 Gigs of free space. These machines are Windows 2008 Enterprise Edition with 8 Gigs of RAM and 4 vCPUs. The cluster used to have 2 vCPUs but the CPU load was hitting 100%, so I increased both nodes of the Windows cluster to 4 vCPUs.
There are 4 app servers which read the messages from the queues and process them.
Normally this all works perfectly, but not today.
This morning everything is running very slowly. The 8 web servers are currently showing up to 300k messages sitting in the outbound queues. The clustered server currently shows over a million messages in the queues (some are as low as 200k).
If I look at perfmon at the 8 web servers it shows that I'm averaging 2 messages sent per second. If I look at perfmon on the cluster it shows ~7 messages per second are coming into the cluster.
The machines which are doing the reading aren't getting many messages each. The fastest services are getting 10-12 messages per second, the slowest are showing 0 or 1.
The only changes recently is that we changed the number of front end web servers from 4 to 8. We did this about 2 weeks ago without issue. On Tuesday we powered them down to see how the remaining 4 could handle the load. On Wednesday we turned the four newer machines back on.
The disk on the cluster shows very low IO and no queueing.
To be safe I've updated PowerPath to the newest version but that hasn't helped any.
The 8 web servers are on one vLAN, and the Cluster'd servers and the app servers are on a second vLAN. There are no firewalls between the vLANs.
And there is nothing useful in the application or system logs on any of the machines.
Whenever someone says they have over a million messages the alarm klaxons go off! Messages require kernel (paged pool) memory to be managed. If you have such a vast number of messages, you may be exhausting what is available on the clustered server. An optimal number for number of messages in a queue is zero - basically make sure you can normally process messages faster than they can arrive.
I would recommend shutting down the web servers and completely processing the backlog of messages before bring them back online again.
Reference Item 4 of this blog post: http://blogs.msdn.com/johnbreakwell/archive/2006/09/18/insufficient-resources-run-away-run-away.aspx
Cheers John Breakwell (MSFT)
I asked one of our sysadmins and he said our magic point was 4 web servers max hitting MSMQ box on virtual machines, then they moved to hardware box to solve. Also try packet capture to see what is going on. Is there much in authentication going to AD also? With how chatty MSMQ is, you need to limit network paths and possibly authentication path.
HTH, Chuck.
Referencing your comment about lack of remote administration, yes, it's not a great story with MSMQ and perf counters. For anyone following the thread and wanting to know what combinations of OSes work then have a look at the Motley Queue blog:
MSMQ 4.0 Performance Counters and the NetNameForPerfCounters Registry Key http://blogs.msdn.com/motleyqueue/archive/2007/12/14/msmq-4-0-performance-counters-and-the-netnameforperfcounters-registry-key.aspx
Cheers John Breakwell (MSFT)