I have three linux-based mail routers that run postfix and relay mail to our on-premise exchange server as well as to outlook.com, splitting the mail based on ldap atttributes. What I've observed sporadically since upgrading this spring from Exchange 2007 to 2010 is that all three of the mail relays will, for about 20 minutes, fail to connect to exchange.
Postfix logs it as "lost connection with exchange.contosso.edu" ; this problem almost always occurs to all three mail relays at the same time, and lasts for slightly under 20 minutes. If I can catch it while it's occuring, and I manually do "telnet exchange.contosso.edu 25" from one mail relay and force a message through (helo, mail from, rcpt to, data, etc), then it clears that relay up.
The exchange "server" is actually two machines with the HT role on them, load balanced via windows NLB.
I've worked pretty hard to figure out what's happening from the postfix side and I can't see any evidence of any misbehavior. My question is, how do I attack the problem from the exchange side? Is there a connection log, or a debug setting, or something I can do to log all of the inbound connections and tell me what's causing exchange to drop them?
After numerous false starts -- taking the NLB out of the equation, adjusting the postfix queue_run_delay, disabling TCP window scaling on the postfix machines, the solution was to disable "smtp_connection_cache_on_demand" in postfix's main.cf:
I don't know whether exchange was at fault in silently closing the connections, or postfix was at fault, or Windows 2008 TCP stack, or Linux -- but disabling smtp_connection_caching solved our problem.