Looking at using DRBD or a clustered files system to help with up-time when downtime strikes in a small business environment.
We currently use a server box for a file server using Linux and samba, then running the web server and Database in a VM. Was looking at adding a second server and putting the files and the VM onto the distributed file system. The base OS is more static and easily can be managed more manually (copy config files at time of change, copy base OS if needed from full backups, etc)
Question is about the fail over scenario if manually done. If server 1 goes down and fail over is manually done, is fail over completed by simply setting the static IP of server 2 to server 1 (again server 1 is down and would be in a state of needing repair), starting Samba, and starting the VM which would have the same static IP's as they had when running on server 1, and starting the backup services?
This sounds like a quick and simple process, almost too simple. Am I missing something? This could easily be automated as well through a script or something that someone with little proficiency could be directed to run in the event of a failure.
Down time if we have a hardware failure could easily be days without the support of on call IT support and the parts needed without a second server, but with the the second server, down time would be at the maximum a matter of hours (if no one is the office proficient enough to perform such operations, minutes if someone was)
The failover process you're describing is as simple as it is correct. Using DRBD is the key step to create redundancy, as you eliminate a single point of failure like a shared storage.
The current failover you're mentioned can be easily automated by Pacemaker/Corosync so that theres no need for manual intervention. I would this prefer to self-written scripts, as it also takes care about fencing unfunctional nodes, so that you don't run into a split brain scenario (which could screw up all your data).
Keep in mind, that "real" HA requires complete (or at least maximum archivable) separation of systems (separate room (or at least rack), different USV, redundant switching etc.). Single point of failures usually screw up you're whole effort to optimise availability.