I'm using SpamAssassin on Debian (the default configuration with Pyzor, AWL and Bayes disabled, and sa-compile enabled), and each of the spamd child processes consume around 100 to 150MB of memory (around 50MB of real memory) on the 32 bit servers, and about double this (logically enough) on the 64 bit servers. There are generally two child processes, but at busy times there can be five (the maximum) running.
ISTM that 200 to 600MB is a lot of memory for this task. I'd like to continue using SA as part of my filtering structure, but it's becoming difficult to justify so much memory.
Are there any ways to reduce the amount of memory that each child process uses? (Or alternatively, make a single child process so fast that I can set the maximum children to something like 2?). I'm willing to consider any options, including ones that will or may result in reduced accuracy.
I've already read the "Out of Memory Problems" page on the SA wiki; nothing there is of any use. Messages larger than 5 MB are not scanned with SA.
I think you're misunderstanding the way Linux reports memory usage. When a process forks, it results in a second process that shares a lot of resources with the original process. Included in that is memory. However, Linux uses a technique known as Copy On Write (COW) for this. What that means is that each forked child process will see the same data in memory as the original process, but whenever that data changes (by the child or parent), the changes are copied and only then point to a new location.
Until one of the processes makes changes to that data, they are sharing the same copy. As a result, I could have a process that uses 100MB of RAM, and fork it 10 times. Each of those forked processes would show 100MB of RAM being used, but if you looked at the overall memory usage on the box, it might only show that 130MB of RAM is being used (100MB shared between the processes, plus a few MB of overhead, plus another dozen MB or two for the rest of the system).
As a final example, I have a box right now with 30 apache processes running. Each process is showing a usage of 22MB of RAM. However, when I run free -m to show my overall RAM usage, I get:
As you can see, this box doesn't even have enough RAM to run 30 processes that were each using 18MB of "real" RAM. Unless you're literally running out of RAM or your apps are swapping heavily, I wouldn't worry about things.
UPDATE: Also, check out this tool called smem, mentioned by jldugger in the answer to another question on Linux memory usage here.
Using sa-compile you might be able to improve the matching speed of many rules.
Here's what I have done.
I have a set-up where a lot of messages tend to be delivered roughly at the same time; for a series of experiments I run SA on messages which are copied to a temporary spool and then delivered by a cron job every five minutes.
spamd
would keep on printing "maybe you should increase the max-children parameter" and I had it raised up to 40 at one point, but I had the server consuming all its swap space and crashing.Now I have implemented a different regime where delivery is governed by a Procmail lock file. Because it was simple to do, I just use the last digit of the process ID, and run with 10 children. I'm not at all sure this is optimal, but it has already helped avoid the insane load peaks I wouled experience from time to time.
In addition, I start up
spamd
with a number ofulimit
restrictions. The numbers were taken out of http://svn.apache.org/repos/asf/spamassassin/trunk/contrib/run-masses except I removed theulimit -u
restriction. (Not sure what's going on. 32 is way too small in any event. With something like 500 I could keepspamd
running for a while, but eventually running into the limit.)I guess I will end up with delivery failures if the load is too high for an extended time, but so far, it seems I have managed to reduce the load to manageable levels with this; and a bunch of failed deliveries is still much better than the machine running out of swap.
High load averages are (sometimes) an indirect symptom that your machine is running out of RAM (and using lots of CPU swapping processes back and forth from virtual memory), so you could try configuring your mail server to not pass mail through SpamAssassin if the load averages are too high.
You don't mention which MTA you're running, but if you're calling SA from an access control list in exim4, then the suggestion at the bottom of this message is effective.
Also, you can relieve the load on SA, and thus reduce its memory usage, by putting some other, less resource-intensive spam-filtering methods in front of it (i.e. so they process and reject some spam before it gets to SA) - for instance, greylisting and sender verify callouts use relatively little RAM.
We were in a similar situation several months ago. SpamAssassin and ClamAV were using lots of memory on a hosted server. We had the option of adding more memory to the server, but it turned out to be more cost- and time-effective to switch over to Postini. YMMV.