I have 7 computers running Gentoo Linux with Quad-Core processors and I want to be able do distribute the execution of a program to all of those machines. I have some multi-threaded programs and I wanted to use all the 28 available CPUs on my cluster instead of running 7 copies of the program, in each node.
it's like the idea of distcc: I have my C/C++ project, and if I compile the sources with distcc
instead of gcc
, it will distribute the compilation process to several computers, and I don't have to change anything even in the Makefile.
for the cluster, it'd be better if I didn't have to change anything in my program's source code (although I think that's impossible). but I can change the program to use an external API, if needed.
There are a few ways to do this but I doubt any of them will let you run your code as is.
Hadoop seems to be a good option for certain type of work loads and is widely used and maintained by Yahoo and others.
Beowulf clusters are more of a traditional cluster. If you look at the Beowulf Wikipedia page there are links to alternatives as well as Linux Distros that focus on clusters such as Rocks.
From your question it sounds like you want all of your machines to bind together magically to make one big computer that you can log into and run programs on. This magic is called SSI (Single System Image) and there are a number of clustering packages that do this. Try any of the ones listed on the wikipedia page.
If you want a traditional cluster or grid set up, you just need a job manager like Torque or Grid Engine. ROCKS is a quick way to get started with this type of set up.
The answer to this is highly application specific. 3dinfluence already mentioned the possibility of Hadoop, which is great if your application breaks down in to the Map-Reduce execution model.
If you are planning to distribute your workload to multiple nodes but still want to have just one instance of your application in a thread-like execution model, you need to look in to some form of MPI.
MPI is a standard with a common interface, but there are multiple implementations of it such as OpenMPI and MPICH. Essentially you design your application to spawn multiple copies that pass messages between each other. MPI then abstracts away the actual method of communication. Instead it provides a series of primitive functions such as send, receive, and broadcast which you use in your application design. The actual communication is then handled by a module in your chosen MPI stack.
OpenMPI includes many transports, including shared memory, TCP/IP, InfiniBand, Myrinet Express, and a bunch more. Which of those you use and how you configure them is again highly application dependent.
Usually your MPI tasks will be allocated nodes on the cluster using some sort of batch queueing system such as Torque or Sun Grid Engine. These become more useful if you share your cluster among multiple users and need to schedule your cluster resources.
I suggest you check out the Gentoo Cluster project site and have a look through some of the linked resources. These will help you get a better understanding of running applications in a clustered environment, and help you narrow down which areas you need more help with.
You might consider testing this idea with a LiveCD distribution like Cluster Knoppix.