I have been working on getting SpamAssassin up and running for awhile now and am pretty close to being finished. However, there is one last thing that is grinding away at me that I can't seem to figure out. I have searched around a bit but have been unable to find an answer that I find to be conclusive, so I just want a little clarity so I can sleep better at night.
I have read that SpamAssassin needs at least 200 messages, preferably 1000 to do an effective job of Bayesian filtering. I have been feeding it spam (at least I think) by issuing the following command:
sa-learn --showdots --mbox --spam spamfolder
As far as I can tell it is being processed by SpamAssassin. So I run:
sa-learn --dump magic
and get the following output:
bruticus@bruticus:~$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 306 0 non-token data: nspam
0.000 0 210 0 non-token data: nham
0.000 0 68430 0 non-token data: ntokens
0.000 0 1318421928 0 non-token data: oldest atime
0.000 0 1319141693 0 non-token data: newest atime
0.000 0 1319142287 0 non-token data: last journal sync atime
0.000 0 1319142287 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta
0.000 0 0 0 non-token data: last expire reduction count
Are the items in the nspam and nham column indicative of the actual amount of learning and messages that SpamAssassin is using for its Bayesian analysis?
Do I need to get these two sets of numbers up into the 1,000's to get SpamAssassin to really start doing its job or how do I know when I have fed it enough spam to start working correctly?
You always need Spam and Ham samples. By only feeding Spam SpamAssassin refuses to activate the bayesian Spam filter.
By issuing a
spamassassin -D < /path/to/a/complete.mail
you can check if bayesian filtering is activated or not (somewhere in the whole debug messages).Hopefully you didn't train SpamAssassin with old Spam (months old). It will only work well if you used recent Spam you (personally or as a company) got in the past. If you don't have Ham or Spam samples right now you should better set SA to autolearn. Then the filter gets trained over time. This takes longer and you can't see the benefit right now, but the outcome will impress you in the end.
Yes, your numbers show the "current" learned messages. If these numbers are greater than 200 you are finished. Everything above just makes it "safer" as in "more valid" or "accurate". With auto-learning these numbers will increase over time and also decrease as statistics of old mails will be dropped over time.