I noticed a problem with postfix logs, that information on some of emails delivery status is missing. The issue affects about 1% of emails.
"Healthy" log:
<server># grep 8EB992EFBB44 postfix_log/mail04.log
Jun 5 03:09:29 mail04 postfix/smtpd[8537]: 8EB992EFBB44: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 5 03:09:29 mail04 postfix/cleanup[34349]: 8EB992EFBB44: message-id=<[email protected]>
Jun 5 03:12:02 mail04 postfix/qmgr[76377]: 8EB992EFBB44: from=<[email protected]>, size=48845, nrcpt=1 (queue active)
Jun 5 03:15:12 mail04 postfix/smtp[35058]: 8EB992EFBB44: to=<[email protected]>, relay=mx.baz.com[xxx.xxx.xxx.xxx]:25, conn_use=70, delay=343, delays=153/190/0/0.24, dsn=2.0.0, status=sent (250 ok)
Jun 5 03:15:12 mail04 postfix/qmgr[76377]: 8EB992EFBB44: removed
"Broken" log:
<server># grep F3C362EF37CA postfix_log/mail04.log
Jun 5 04:03:27 mail04 postfix/smtpd[39666]: F3C362EF37CA: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 5 04:03:27 mail04 postfix/cleanup[41287]: F3C362EF37CA: message-id=<[email protected]>
Jun 5 04:03:28 mail04 postfix/qmgr[76377]: F3C362EF37CA: from=<[email protected]>, size=48892, nrcpt=1 (queue active)
** here should be a log line from postfix/smtp but there is none **
Jun 5 04:03:29 mail04 postfix/qmgr[76377]: F3C362EF37CA: removed
Background information:
system: FreeBSD xxx.xxx.xxx 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
Postfix is installed inside jail. Logs are on the same machine, log dir is mounted thru nullfs. The site has spikes of heavy load, causing disks (local) to run at 100%.
Update
The log is rotated daily, current size is ~ 500MB.
I made a test by queuing 99000 messages to same destination (in order to rule out dns/network/mx issues). 5715 messages don't have any DSN record. Failed messages queue time is spread evenly over time, i don't see any time-bound issues.
Some undelivered emails:
envelopeid | processed_time
--------------+----------------------------
8D7652EF3BAE | 2012-06-06 13:19:11.072715
DD53A2EF3C5C | 2012-06-06 13:33:24.374783
8C52F2EF4E3F | 2012-06-06 13:39:15.810616
BBC572EF525C | 2012-06-06 13:44:22.762812
E95822EF54D1 | 2012-06-06 13:52:01.134533
839DD2EF4FBB | 2012-06-06 14:13:48.511236
017EE2EF6234 | 2012-06-06 15:04:48.618963
Those are a few picks, such records of undelivered email occur almost every second.
<server># egrep '(8D7652EF3BAE|BBC572EF525C|017EE2EF6234)' mail04.log
Jun 6 13:19:10 mail04 postfix/smtpd[20350]: 8D7652EF3BAE: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 6 13:19:10 mail04 postfix/cleanup[21024]: 8D7652EF3BAE: message-id=<[email protected]>
Jun 6 13:19:10 mail04 postfix/qmgr[7939]: 8D7652EF3BAE: from=<[email protected]>, size=63718, nrcpt=1 (queue active)
Jun 6 13:19:11 mail04 postfix/qmgr[7939]: 8D7652EF3BAE: removed
Jun 6 13:44:22 mail04 postfix/smtpd[20346]: BBC572EF525C: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 6 13:44:22 mail04 postfix/cleanup[24811]: BBC572EF525C: message-id=<[email protected]>
Jun 6 13:44:22 mail04 postfix/qmgr[7939]: BBC572EF525C: from=<[email protected]>, size=63758, nrcpt=1 (queue active)
Jun 6 15:04:49 mail04 postfix/smtpd[20344]: 017EE2EF6234: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 6 15:04:49 mail04 postfix/cleanup[35585]: 017EE2EF6234: message-id=<[email protected]>
Jun 6 15:04:49 mail04 postfix/qmgr[7939]: 017EE2EF6234: from=<[email protected]>, size=63706, nrcpt=1 (queue active)
<server>#
<server># find /var/spool/postfix/active/ -type f -print | wc -l
1
<server>#
IMPORTANT: As you can see above some of emails doesn't event have the removed
line.
WhIch postfix version is this ?
Have you tried tuning syslog ?
If the disk is at or near its I/O capacity, syslogs may be delayed longer than you expect.
Also, what is "mail04.log" ? How big is it ? Are you certain the entirety of that message was logged in that file ?
Send the logs offsite using either UDP or (better) TCP remote syslog and check that the message is missing from all logs.
If any Postfix processes are configured in
$config_directory/master.cf
to run in the chroot jail, you need to configure syslogd to listen on$queue_directory/dev/log
in addition to the default/dev/log
socket (so that chrooted processes will have access to/dev/log
inside chroot). If you don't do this, initially things seem to work, but logging from chrooted processes breaks when syslogd is restarted for some reason (which may happen during log rotation).