The company I work for just bought 3 PowerEdge 2970 servers and they all have the same problem.
- Is this server worth buying or are the problems that come with it make it not worth it?
- Are there alot of issues with using an AMD processors (it's an Opteron)?
- Are you guys able to pin point the problem if I give details on which errors I get in the event logs?
Here is the problem:
1.Power on server. It boots up to the red hat splash screen.
2.In the middle of the boot up the server crashes with the following errors:
-CPU Machine Chk: processor sensor, transition to non-recoverable was asserted
-PCI Parity Err: critical event sensor, PCI PERR (BUS 0 DEVICE 1 FUNC 0)
Then I tried to update the bios and the BMC but the problem was still there. After that I tried to update the OS (it had red hat Enterprise 5.1) to red Hat 5.3 There was something odd there too. I booted the server with the Build and update utility then selected install OS. I selected red hat enterprise 5.3 x86_64. It queried me for the x86_64 media so I put in the disc that said : supplementary disc 1 of 1 for 64-bit AMD64 and Intel 64. It said wrong disc. So then I used the disc that said: installation disc 1 of 1 for 64-bit Intel Itanium. My guess is thats the disc I needed to use all along.
After this the system was able to boot up to the command line login screen. I loggued in and typed : startx to get into the gui environment. At that point less than a page of text scrolled fast and the server crashed without showing anything gui related.
At that point I had at 2 different errors(notice the device is 4 now, gonna check which device it is):
-PCI Parity Err: critical event sensor, PCI PERR (BUS 0 DEVICE 4 FUNC 0)
-PCI Sytem Error:critical event sensor, PCI SERR(BUS 0 DEVICE 4 FUNC 0)
So today the tech guy came with a bunch of parts and basically rebuilt the server (PCI riser, mother board, DIMMs, a SAS card and something else I cant figure off the top of my head)on site but after that the problems were even worse. Some of these errors were(mind you at that point he was putting back some of the original parts so things got messy):
ECC uncorr Err: memory sensor, uncorrectable ECC (DIMM1 DIMM2) was asserted.
E1231 1.2V HT core power GD
E1911 <3 ERRORS check log
E1000 failsafe
Tomorrow he is coming back with a power supply...
UPDATE: Seems like I cant waste anymore time on this. We are calling the sales people and asking for new servers.
I have ran into similar problems with Dell of late. The tech support doesn't seem to be able to directly associated the errors with the failed part. Alot of the time they just send out what i like to call "The I Have No Idea Whats Wrong Parts Pack". Usually consists of Systemboard, PCI riser, replacement memory and sometimes a replacement CPU and RAID controller.
One thing they often forget to replace is the riser for the integrated PERC card. And I have seen that be the issue a few times.
Anyways as I commented before unless you are in a real rush to deploy these servers I would contact Dell customer care and demand that all three servers are replaced or refunded.
I've seen this with bad RAID cards before. I would suggest
1) pulling all cards you can and see if it can boot and more importantly:
2) CALL DELL. Their enterprise tech support is really good, and honestly it sounds like you have a hardware error.
As far as your questions...
1) That's completely subjective
2) Opterons should be just as reliable as an Intel part
3) You'll need to ask the question first
As for the problem you posted, I'd start by running Memtest on it if you want to troubleshoot (it sounds like a memory error message - the PCI bus & device numbers should tell you specifically though). On the other hand, I'd simply insist that the support reps fix the problem with the servers they sold you.
Good to rule out the OS first. Try installing windows server. Windows has the most wide driver support. If windows can't even install then you know for sure that there is probably some hardware fault. If you don't have a copy of that then ubuntu server works well on most hardware as far as I know.
We had a server that refused to install one very common linux distro. As soon as I put ubuntu server on it, it worked first time. Perhaps at some point Redhat was on there and working but a kernel update has been unsuccessful?
You might want to also try setting the bios to defaults. Also try reinitializing the RAID drives and setting that back up again.
I'll second the test of a different OS suggestion, but what I would really be doing at this point in the exercise is yelling down the phone at my sales rep about how I want those servers replaced now. You've just bought them, they're brand new, so they should be covered by the standard sales warranty that Dell are legally obliged by consumer law to have, irrespective of the maintenance/support plan you've chosen.
It looks to me as though you're being given something of a run-around here, and I think you've put up with enough. It's time to get known-good equipment in.