We have a front end MS Exchange 2003 Server and recently people started complaining that emails weren't being delivered. Upon looking closer at the logs we found this error message:
> Event Type: Error
> Event Source: Service Control Manager
> Event Category: None
> Event ID: 7031
> Date: 6/4/2009
> Time: 11:08:00 AM
> User: N/A
> Computer: <Server>
> Description:
> The IIS Admin Service service terminated unexpectedly. It has done
> this 39 time(s). The following
> corrective action will be taken in 1
> milliseconds: Run the configured
> recovery program.
We found MS KB article Q304166 and were able to determine which message in the Exchsrvr\Mailroot\vsi 1 folder was causing the problem by removing them one at a time and restarting the services.
The email that was causing all this havoc was a 200K PDF file that was emailed to 3,500 addresses. Why would exchange be crippled so badly? I realize 3500 is a large number of people to email but I would have guessed that the SMTP server would have throttled the connections and slowly sent out this email over the evening or even a couple days.
My question:
Is there a way in Exchange to determine maximum SMTP load? Have others seen this same reaction, or should we be looking for a misconfiguration on the server?
When/if we need to send to a large group again, is there a way to gauge how much the server can handle in one batch, or do I need to user Perfmon and just start testing to see how it handles 100,250,400, etc?
It's not a capacity problem-- that's a bug. Any unhandled exception in an application is a bug. (Never let a developer tell you otherwise.)
Are you current on patches?
Edit: Sounds like you've found a bug to me, if you can repro the problem. I have no idea how to actually report such a bug to Microsoft, but it probably needs to be reported.
This is definitely a bug. The SMTP engine in Exchange 2003 is actually built as a set of extensions that are loaded and run by the IIS SMTP engine. If w3svc is crashing, it's most likely because of a malformed message (including the possibility of a bad address or misbehaving recipient server)-- not because of the load generated by the size of the file or the number of recipients.
If you wanted to test this further, you could have the user send the message to smaller subgroups of the original 3500 in an effort to narrow down the problem.
The server shouldn't have crashed on this. I'd call MS for a follow up, I've had great experience with MS support and its well worth it. If you pay for a ticket with MS and the issue turns out to be their fault (as this looks like) they refund the ticket.
As a side note you can purchase technet plus from MS which costs less then the cost of 2 support incidents and the technet plus includes 2 support calls plus access to test software & managed news groups.