Ping a Specific Port

Question

Soviero

Asked: 2012-02-25 17:35:02 +0800 CST2012-02-25 17:35:02 +0800 CST 2012-02-25 17:35:02 +0800 CST

Do Dual CPUs Provide Fault Tolerance?

772

Let's say I bought two Intel Xeon's and installed them into server class hardware... If one CPU failed would the other still function and pick up the slack, therefore providing fault tolerance?

This does not seem very likely, but I figured I would ask instead of making any assumptions.

6 Answers

Voted

Mark Henderson · Answer 1 · 2012-02-25T17:39:52+08:00

Best Answer

Mark Henderson

2012-02-25T17:39:52+08:002012-02-25T17:39:52+08:00

In a normal dual-socket system, no, although there are servers that do permit hot-swapping of processors and RAM. So these things do exist, but they're at the very, very high-end of the market.

It's not really a big deal - of everything in your server that can fail, the processor is right on the bottom of the list, next to those little brass risers that hold the motherboard off the chassis.

29

aseq · Answer 2 · 2012-02-25T17:47:11+08:00

aseq

2012-02-25T17:47:11+08:002012-02-25T17:47:11+08:00

Talking about x86 commodity hardware, if a system is running and a CPU fails things will grind to a halt normally. However the system will function fine after a reboot, albeit somewhat slower.

Multiple CPUs mostly are there to have parallel processing, not really for fault tolerance. But it's nice to have a system that still boots would a CPU (or more) fail.

I would say it's bit more likely your CPU fails than Mark Henderson suggests, but it still is very unlikely. In my experience mostly it happens when the system frequently overheats and shuts itself down (that's quite easy in a badly airconditioned office server room). The CPUs don't tend to like that a lot.

Of course if you had a nice IBM mainframe or similar, hot swapping a CPU (board) is "easy" enough.

9

fluffy · Answer 3 · 2012-02-25T20:47:22+08:00

If a CPU were to fail - which is extremely unlikely, per the other answers - there is basically nothing that the system could do to recover. Depending on the way it fails it could end up corrupting memory in strange ways, or destroying the process table, or who knows what else. If you were to have some sort of active monitoring system that keeps tabs on the CPU to make sure it's working well (and able to, say, roll back any changes made by the CPU during its death throes), that would also be another system that can fail, and determining software failure programmatically is pretty dang difficult (basically the only way you can practically do it is by having another CPU doing the exact same stuff at the exact same time and compare the results - which will then end up slowing things down such that there's no point to having another CPU to begin with).

That said, as rare as a CPU failure is, increasing the CPU count in a system will actually make your failure rate go up, as now you have twice as many things that can fail. You also have other subsystems that can fail as well, such as those which keep the CPUs' caches synchronized, and the increase in power consumption and thermal output also contribute to the factors behind overall system failure (and of course, active cooling fans are another point of failure).

Anders Sjöqvist · Answer 4 · 2012-02-25T23:34:20+08:00

You'll have to define exactly what kind of failures you want to handle. If we regard a collection of cores/CPUs/computers working together as a network, one type of failure is that a node simply stops answering. A much more severe failure is when a node starts to corrupt data and sends faulty information to the others. This is called a Byzantine failure, and in the worst case it's actively disrupting the operation of the network through strategic "lies". It's relatively easy to show that no system could handle a third or more of its nodes going Byzantine.

What you need to do, is to decide exactly what kind of failures you're expecting, and design your system with that in mind, and accept the fact that the problem of handling an arbitrary number of malicious nodes is unsolvable. In your case, you need at least four CPUs if one of them is faulty.

On a side note: In quantum physics there are no impossibilities, but if have to wait longer than the age of the universe to statistically have a chance to observe a certain behavior, we don't have to say that it's possible. Keep that in mind when you design your system. ;)

ewwhite · Answer 5 · 2012-02-25T17:47:07+08:00

ewwhite

2012-02-25T17:47:07+08:002012-02-25T17:47:07+08:00

CPU failure is might-rare. A failure would probably result in other problems at the OS level. I would not think of this as any form of fault-tolerance.

2

Coré · Answer 6 · 2012-02-25T23:50:38+08:00

Coré

2012-02-25T23:50:38+08:002012-02-25T23:50:38+08:00

As the other answers, is very rare that a CPU fails, and in the average servers you can't do a hot swap, what you can probably do is leave the server with one CPU until the failed one is replaced, of course, this procedure is totally offline and you need to do a stop of the server

1

Do Dual CPUs Provide Fault Tolerance?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?