If you can't afford or don't need a cluster or spare server waiting to come online in the event of a failure, it seems like you might split the services provided by one beefy server onto two less beefy servers. Thus if Server A goes down, clients might lose access to, say email, and if Server B goes down, they might lose access to the ERP system.
While at first this seems like it would be more reliable, doesn't it simply increase the chance of hardware failure? So any one failure isn't going to have as great an impact on productivity, but now you're setting yourself up for twice as many failures.
When I say "less beefy", what I really mean is lower component spec, not lower quality. So one machine specification out for visualization vs two servers spec'd out for less load each.
Often times a SAN is recommended so that you can either use clustering or migration to keep services up. But what about the SAN itself? If I was to put money on where a failure is going to occur, it's not going to be on the basic server hardware, it's going to have something to do with storage. If you don't have some sort of redundant SAN, then those redundant servers wouldn't give me a great feeling of confidence. Personally for a small operation it would make more sense to me to invest in servers with redundant components and local drives. I can see a benefit in larger operations where the price and flexibility of a SAN is cost effective. But for smaller shops I'm not seeing the argument, at least not for fault tolerance.
This all boils down to risk management. Doing a proper cost/risk analysis of your IT systems will help you figure out where to spend the money and what risks you can or have to live with. There's a cost associated with everything...this includes HA and downtime.
I work at a small place so I understand this struggle and the IT geek in me wants no single points of failure anywhere but the cost of doing that at every level is not a realistic option. But here are a few things that I've been able to do without having a huge budget. This doesn't always mean removing the single point of failure though.
Network Edge: We have 2 internet connections a T1 and Comcast Business. Planning on moving our firewall over to a pair of old computers running pfSense using CARP for HA.
Network: Getting a couple of managed switches for the network core and using bonding to split the critical servers between the two switches prevents a switch failure from taking out the entire data closet.
Servers: All servers have RAID and redundant power supplies.
Backup Server: I have an older system that isn't as powerful as the main file server but it has a few large sata drives in raid5 which takes hourly snapshots of the main fileserver. I have scripts setup for this to switch roles to be the primary file server should it go down.
Offsite Backup Server: Similar to the onsite backup we do nightly backups to a server over a vpn tunnel to one of the owners house.
Virtual Machines: I have a pair of physical servers that run a number of services inside of virtual machines using Xen. These are running off a NFS share on the main file server and I can do live migration between the physical servers if the need arises.
I think this is a question with many answers but I would agree in many smaller shops the several server solution works and as you say, at least something keeps going if there is a failure. But it depends on what fails.
Very hard to cover all bases but redundant power supplies, good quality power and good backups can help.
We have used Backup Exec System Recovery for some critical systems. Not so much for daily backup but as a recovery tool. We can restore to different hardware, if available, and we also use the software to convert the backup image to a Virtual Machine. If the server fails and we need to wait for hardware repairs, we can start a VM on a different server or workstation and limp along. Not perfect but it can be up and running quickly.
Regarding SANs: Almost anything you use will be redundant. Even if it's a single enclosure, inside will be dual power supplies, dual connectors, and dual 'heads', each with links to all disks. Even something as simple as an MD3000 sold by Dell has all these features. SANs are designed to be the core of your boxes, so they're built to survive just about any random hardware failure.
That being said, you have a point that redundancy isn't always the best option. ESPECIALLY if it increases complexity. (and it will) A better question to ask is..."How much will the company accept downtime". If the loss of your mailserver for a day or two isn't a big deal, then you probably shouldn't bother with two of them. But if a webserver outage starts losing you real money every minute, then maybe you should spend the time making a proper cluster for it.
The more servers you have the more chances of something breaking, thats one way of looking at it. Another is if one breaks, you're up the creak 100%, also just like you are saying.
The most common hardware failure is HDs, like you were saying above. Regardless of how much you want to split operations between, you need to be RAIDing you storage.
I would vote for a couple servers (RAIDed of course) instead of one massive one, both for operations stability, and performance. Less software bumping into each asking for resources, reduced clutter, more disks to be read/written to, and so on.
I would personally opt for multiple servers. I don't think equipment failure is more likely in this scenario. Yes, you have more equipment that could fail, but the odds of any given unit failing should be constant.
What having multiple servers in a non-redundant/non-HA configuration gives me is the ability to off-load some of the work to another server in the event of a failure. So, say my print server goes down. If I can map a few printers to the file server while I'm fixing the print server, the impact to operations is lessened. And that's where it really matters. We often tend to talk about hardware redundancy, but the hardware is only a tool for continuity of operations.
I work in a small shop (one man IT department) and wouldn't swap my multiple servers for a single one under any circumstances. If any one of the servers goes down I have the option of either adding the now missing services to another machine or even just setting them up on a spare PC. We can live with an outage of an hour or two for most things but we can't live with a complete outage of all systems. While I can replace any of our servers with a PC, at least temporarily, I don't have, or can readily get hold of, anything anywhere near powerful enough to replace all the servers at once.
Your original post hypothesize that you can't afford a cluster, but you consider solutions with two servers (not including backups). That would imply that you most likely have three server on hands, enough to start a cluster.
There are intermediate solutions that can avoid SPoF and still be appropriate in small/medium sized businesses : node-to-node replication without SAN storage.
This is supported for example by Proxmox (but I think it also is supported by XCP-ng/XenServer and probably by ESXi).
Let's consider a 3 nodes setup. All with RAID, redundant PSU, redundant network.
Then two options :
This kind of setup can tolerate a network failure, a total and major node failure (any of the three), with a downtime of a about 1 minutes (roughly the time needed for a VM to boot up). The downside, is the lost of data since the last replication (which depending on your settings and hardware performances can be as low as 1 minute, and as high as a few hours).
With the 2nd option (VM normally split between node A and B), you have to prioritize which VM are allowed to come back online. Since, as your VM load is usually split between two server, having all of them running on a single node might exhaust the RAM of the node or congest the CPU.
"While at first this seems like it would be more reliable, doesn't it simply increase the chance of hardware failure?"
It is never this simple, big beefy servers may be better made or worse made. They may have higher quality parts, but maybe make more heat and are not cooled properly. A beefy server has more RAM, more CPU's etc, so in the end maybe you have just as many CPUs in both scenarios so maybe a server is not the right unit to think about.
Because of the complexity of the chances, whatever is most cost effective wins I think. If you have to pay for licenses 1 big server may be cheaper than a few smaller servers depending on the licensing structure.
My default approach is to avoid any centralized infrastructure. For example, this means no SAN, no Load Balancer. You can also call such a centralized approach "monolithic".
As a software architect, I'm working with the customer's infrastructure. That might mean using their own private data-center, or using something like AWS. So I don't usually have control over whether they use a SAN or not. But my software usually spans multiple customers, so I build it as if it will be run on individual machines in isolation on a network.
The Email Example
Email is weird, because it's a legacy system (that works). If email was invented today, it would probably use RESTFul APIs on web servers, and the data would be in a database that could be replicated using normal tools (Transactional Replication, Incremental Backups).
The software architecture solution, is that a Web Application would connect to one of a list of available nodes (at random), and if that's unavailable it will try to connect to another node (at random). A client might get kicked off a server, if it's too busy. Here, there's no need for a load balancer to connect through to a web farm; and, there's no need for a SAN for high availability. It's also possible to shard the database per-department or geography.
Commodity means...
So instead of having an expensive 1 or 2 servers and a SAN with internal redundancy measures, you can use several commodity low-power low-cost machines.
Simplicity - redundancy comes purely from the number of devices. You can easily verify your redundancy by the amount of machines. And you more correctly estimate they have a higher chance of failure and prepare for that.
Redundancy percentage - If you have 2 servers, if one fails you have 1 left (50%). If you have 10 commodity servers and one fails you have 9 left (90%)
Inventory - a commodity device is readily available from any nearby shop for a great price.
Compatibility - with fibre channels, and all kinds of standards for disk volume formats, commodity devices and software architecture means you are not locked into a single device model or brand.
Performance - With 2 devices on SAN, they need to be in the same room. With commodity machine approach, if you have 5 offices, you can have 2 in each office, with VPN WAN redundancy between offices. This means software and comms is on the LAN at <1ms access time.
Security - building on the high-level of redundancy, you can easily rebuild nodes as a commodity regular process. Want to rebuild a monolithic 2-server cluster? Get out the manual. By rebuilding machines often (with automation) you keep software up to date, and prevent any hacker or virus gaining a foothold on your network.
Note: You would still need to have multiple switch and gateway router redundancy