We have a service that we offer to one of our clients that sends statements to their clients. Unfortunately, our customer's validation of email addresses isn't always the best, so when they submit a batch to us to send we usually see a lot of emails that fail to send due to misspellings of common domains (ex. gnail.com, htmail.com, etc.). We collect the bounce backs to give to our client to fix their email list, but until that bounce back/NDR occurs, these messages remain in our postfix mailq until they reach the queue expiry time. Retrying to send again and again.
Is there any good way to have postfix send an NDR for these messages immediately on the first failure instead of trying again and again? We only want this to happen to bad domains though, sometimes, we see messages deferred due to greylisting, and they go out on the second attempt.
I have tried out my Google-foo, but it seems to be failing me in this task...
Start by rejecting completely invalid mail by adding
reject_non_fqdn_recipient
andreject_unknown_recipient_domain
to appropriate restriction table.Postfix can additionally
REJECT
messages viacheck_recipient_mx_access
andcheck_recipient_access
, but how you generate an appropriate lookup depends on your acceptable level of error (rejecting non-typos / non-rejecting typos).Ideally, never deliver mail to a recipient domain with small Levensthein distance to a much more popular recipient domain. The larger the volume difference between the designated recipient and a similar-sounding target, the less likely the recipient is the intended one (one does not simply operate a large email service that spells almost like
gmail.com
).What is much simpler and worked for me was a
pcre
table that had the most commonly mistyped domain hardcoded like:I kept false positives at an acceptable level by never matching both TLD change and permutations at once - all the big recipients do not sound too similar to other legitimate businesses anyway. Why was this sufficient for me, even though I did not match all permutations? Because 90% of typos are in the same group of 5% possible permutations, so it catched most typos.
If you have logs of at least a few months, you can probably get a pretty good start by just grepping for queue timeouts from connection timeouts, removing false positives by hand and completing the list like I did above.