Background: I am forced to remotely upgrade a server from Ubuntu 8.04 LTS to 10.04 LTS due to an incompability issue with the raid controller.
The internet connection to the server is somewhat stable and seldom drops. Despite that I am concerned about losing the connection over SSH while doing the upgrade, leaving the server in an unreachable state. I am also worried about the server not being able to boot after the upgrade, in case I will be unable to know what is the problem.
Action plan: What I am looking for is advice to minimize the risk of losing the server, I am aware that what I am doing is very risky. This is my current action plan:
1) Backup everything that matters, locally and externally.
2) Temporarily disable boot-time disk checks with fsck. (I will have no clue what is going on if the disk check would take a long time to finish). This would be done through fstab by changing the very last paramter from 1 to 0:
UUID=5b1ff964-7608-44fd-a38d-7e43ad6b4c11 / ext3 relatime,errors=remount-ro 0 0
3) Starting all upgrade processes with with screen so that they can be resumed if I lose the connection. Ie:
sudo screen apt-get upgrade
Questions:
- Does my proposed action plan seem reasonable?
- Is disabling the boot-time disk check a bad idea?
- What else could be done to decrease the risk of losing the server?
Update: Almost all answeres suggested me to setup DRAC/IPMI which I have now done. This feels like a really great acheivement that will for sure make the risk much much smaller as I can follow the entire power cycle over KVM/console redirection. For future references, this is what I have done:
1) Installed ipmitool to setup IP address, gateway etc for IPMI v2.0:
sudo ipmitool lan set 1 ipaddr 192.168.1.99
sudo ipmitool lan set 1 defgw ipaddr 192.168.1.1
2) Installed free-ipmi to change the NIC selection mode to shared (I have only one network interface connected to the network):
sudo ipmi-oem dell set-nic-selection shared
3) Used DRAC's https interface on https://192.168.1.99 to launch the console redirection viewer. This allows me to follow the entire boot sequence as well as configuring BIOS, raid controllers etc. Awesome.
Update 2. Done. All went with a charm, took less than 30 min to do the job. I ended up not turning off the disk check as the redirected console gave me the freedom to interrupt it whenever I wanted to, but I let it run to the end.
Thank you guys, your wisdom is invaluable!
If hardware does not break, there isn't anything you can't do with a serial console, so that's the way to go:
Also, install the new system on another disk or partition if at all possible, so you can test the new system before erasing the old one. I usually do that with two disks system: I take one disk out of the mirror, create a new (degraded) mirror with the free disk, install there, if everything is ok I destroy the old mirror and hot-add the 'old' disk to the new mirror and let it rebuild.
EDIT: I read it's a Dell R710, AFAIK that should have IPMI2. Configure it running ipmitool locally on the system, and test the serial over lan feature using ipmitool sol enable on another system. Bang! You have your serial console. Dells also are able to redirect BIOS on the serial console (that IPMI will in turn redirect on serial-over-lan). You should have done that anyway to get access to the system if anything goes really bad. I manage a couple of old Dell PE1425 using null modem cables with bios,grub,system serial consoles, and a couple of Dell R300 the same way but using IPMI serial over lan in place of the actual serial cable.
Personally, depending on how important this server is to your (your business, etc.), I'd get my hands on a similar system and try reproducing the environment and then upgrading it via SSH right in the room (or physically accessible to you) so you can test your procedure. If you can upgrade that without losing your configuration/connection, you stand a pretty good chance of being able to upgrade the remote server.
This won't be 100% exact, but it at least should eliminate errors caused by software upgrades, software configuration, alterations and the like as long as you can make the test system as closely configured to your remote server as possible.
EDIT: Another solution is to create a second server as failover first. This way if the server dies you still have a backup for customers/users until the primary server comes back up. This should alleviate some of the butterflies you're experiences with having one server so far away. Again, this may be kind of overkill in many circumstances, but that depends on how important this business server is to your company and the impact downtime will have as to how much you're willing to spend on making sure it's available in the event of total failure.
I think that Out-of-Band Management (I'm most familiar with HP's iLO), or even IP KVM would be your best bet.
As Bart mentioned, Testing is invaluable if you have the resources (read: a spare similar box or fellow cluster member).
Finally, (or first, actually) Backups. Tested Backups. Backups you can be proud of...