There are many open source Message queues for Linux, and I need some help deciding what I should go for.
My problem is simple - I get sent a list of files that needs to be processed. Each job can't be split up, but they are self contained and can be spread to multiple computers.
I'm thinking of solving this using a message queue. Multiple clients send a message to a central queue. Each queue has a number of subscribers that will take jobs from that queue when they have finished processing the current job.
Ideally it should have the following qualities
- Message queue must be able to store unprocessed messages in case of a shutdown/reboot
- A job can only be processed by a single subscriber (don't want duplicate jobs)
- The subscribers should be able to send jobs of their own, that will be processed by a different set of subscribers.
Can anyone suggest a simple to use message queue?
You have RabbitMQ and ZeroMQ, but afaik ZeroMQ doesn't store unprocessed messages in case of a crash. They're both open source and use AMQP, an open messaging protocol.
a very simple to use is memcacheq, which uses the same API as memcached, so you can use the same libraries. it uses a BDB backend, so it's not RAM-only like memcached
Beanstalkd is a simple job-queue system that matches your basic needs. It can use a binary log to provide persistance if the queue itself fails and will only allow one worker to have a job at once, though jobs are also set with a timeout, so if they are not deleted, or returned to the queue before that, they are made available again (in case of worker problems).
I did a presentation on beanstalkd for a local user-group, which has some more information.
I just went through this in my latest architecture planning..
Basically.. "the message queues".. all have problems that none of them guarantee both of the follow characteristics at the same time..
Guarantee recieving a message
Guarantee no duplicate messages
So what is currently offered as an open source solution cannot perform these two imperative tasks simultaneously.. (unless your want to spend 50K with IBM)
There is one great video which suggests that cassandra can handle this with quorum reads/writes, but is not taking into account concurrency on a high scale:/
In the end i settled on REDIS actually ( i avoided the clustered solution )
Simply and effectively single threaded.. (to avoid duplicity) Offers a atomic BlockonPop or even a multicast pubsubhubbub feature for queue workers..
a homegrown solution was developed to manage "lost jobs" that never arrived.. ( reliability )
Its quite a simple model actually.. seemingly easy to maintain aswell..
Hope this helps..
I've used Beanstalkd for this type of task. It can be configured to persist jobs to disk between reboots. To help with removing duplicates, I pushed job identifier into memcached - if the job was in memcache already, delete it rather than queueing in Beanstalkd.