I'm running a small Debian compute cluster on 8-core PCs with 16GB of RAM. I'm running batches of about 1k tasks (each batch has a total expected time of a month to be finished). A single task is single-threaded (so I can run multiple of them in parallel on each PC), does not consume much IO (loads several megabytes of data on start and dumps several megabytes of data on exit; does not communicate with outside world otherwise), its run time is unknown (from few minutes to ~week), its memory consumption is unknown (ranging from several megabytes to ~8GB; usage may grow slowly or quickly). I'd like to as many such tasks as possible in parallel on a single PC, but I want to avoid excessive swapping.
So I got an idea: I could monitor the memory usage of these tasks and suspend (kill -SIGSTOP
) or hibernate (using a tool like CryoPID) tasks which consume too much memory to restart them later. By memory usage I mean number of “active virtual pages”, or number of allocated, not shared memory pages that have actually been touched (these tasks may allocate memory without using them).
I started looking for some tools to do that. I know that I can ulimit
or run a task inside a memory-limited cgroup, but—if I understand them correctly—these solutions will kill the process instead of suspending it. I want to avoid killing them, because I would need to start them from scratch later, and that means wasted time. Also, they cannot actually measure the number of active virtual pages.
I could use real virtual machines, but they seem to have significant overhead in this case—having separate kernel, memory allocations, etc. would decrease the available memory; I'd have to run 8 of them. Also, as far as I know, they'd add computational overhead too.
I imagine that a tool that would implement such behavior would hook up some function to a page fault notification that would decide on each page fault whether it is time to suspend the process or not. But I don't know any tool that would work this way either.
Are there other choices?
What you are referring to is process checkpointing. There is some work in the later kernels to offer this (in conjunction with the freezer cgroup) but its not ready yet.
This is actually very difficult to achieve well unfortunately because certain resources which are shared go stale after being unavailable for a fixed period of time (TCP springs to mind, although this may also apply to applications that use a wall clock, or perhaps some shared memory that changes state during a processes offline period).
As for stopping the process when it reaches a certain memory utilization, theres a hack I can think of that will do this.
cgroup.event_control
and set a memory threshold that you do not want to exceed (this is somewhat explained in the kernel documentation.)Note the "freeze" cgroup will not evict pages to a media persistent location, but it will swap the pages out when enough time has passed and the pages are needed for something else.
Even if this does work (its pretty hacky if it did) you need to consider whether or not this is really doing anything to solve your problem.
sched_min_granularity_ns
.Unfortunately, the best solution would be the ability to checkpoint your tasks. Its a shame that most of the implementations are just not that concrete enough yet.
Alternatively, you could wait a couple of years for proper checkpoint/restore to be available in the kernel!
I'm guessing this question is a bit over my head, or I misunderstood it, but are you looking for something like
ps auxww and then checking the VSZ column. Then if VSZ hits a certain amount, you execute your SIG on that process? And then just run the command at your favorite interval?
From ps man page
vsz VSZ virtual memory size of the process in KiB (1024-byte units). Device mappings are currently excluded; this is subject to change. (alias vsize).