I search a job management solution for local processes. Usually they run for some weeks. At the moment I use a jenkins, but there the server is not restartable (security updates) and there is no redudancy. If one server goes offline, all the jobs should be rebalanced to the online servers. It is okay to just start the script again with the same parameters, but it should be possible to disable this behavior. Also it should be easy to add/remove new servers.
I dont need a full solution for everything, but I search for a software like this and did not really find, what I was looking for. I appreciate any hints (also search keywords) pointing to the right direction. I basically just found CI software, but I want a server fault tolerant solution.
There are a number of ways to skin this cat.
One solution is to use a 'workflow' tool chain. Generally you start with a message queue, which is where you queue up jobs, something like RabbitMQ, Redis, or AWS SQS. Followed by some sort of task runner or executor, like Sidekiq or Celery.
The benefits of this sort of workflow is so that you can scale it out, handling things like failed tasks, failed servers, retry logic, reporting etc.
You can spin up clusters of the database component, and clusters of the worker component, which would let you build in the redundancy.
There are also compute schedulers, something like Kubernetes. Here you play Tetris with the available resources you have across servers, and jobs will be scheduled until you run low on resources.
A third solution could be to use task monitoring tools, such as Monit, or Supervisord, which are designed to monitor processes and restart them when they go down. This approach requires you to handle most the edge cases yourself, but would likely be easier to get going quickly.
A forth, simpler still solution, could be to use something like cronjob, or windows scheduled tasks. Here your code runs on a schedule. You could scale this out by adding more servers, but run into the same issue as the solution above, that you'd have to handle things like race conditions in your own code.
All of the above solutions, can be managed by infrastructure & config management tools, things such as Terraform & Ansible, which would allow you to keep things uniform, to ease updates, and redeployments.