Please note: The answers and comments to this question contains content from another, similar question that has received a lot of attention from outside media but turned out to be hoax question in some kind of viral marketing scheme. As we don't allow ServerFault to be abused in such a way, the original question has been deleted and the answers merged with this question.
Here's a an entertaining tragedy. This morning I was doing a bit of maintenance on my production server, when I mistakenly executed the following command:
sudo rm -rf --no-preserve-root /mnt/hetznerbackup /
I didn't spot the last space before /
and a few seconds later, when warnings was flooding my command line, I realised that I had just hit the self-destruct button. Here's a bit of what burned into my eyes:
rm: cannot remove `/mnt/hetznerbackup': Is a directory
rm: cannot remove `/sys/fs/ecryptfs/version': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/inode_readahead_blks': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/mb_max_to_scan': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/delayed_allocation_blocks': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/max_writeback_mb_bump': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/mb_stream_req': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/mb_min_to_scan': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/mb_stats': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/trigger_fs_error': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/session_write_kbytes': Operation not permitted
rm: cannot remove `/sys/fs/ext4/md2/lifetime_write_kbytes': Operation not permitted
# and so on..
I stopped the task and was relieved when I discovered that the production service was still running. Sadly, the server no longer accept my public key or password for any user via SSH.
How would you move forward from here? I'll swim an ocean of barbed wire to get that SSH-access back.
The server is running Ubuntu-12.04 and hosted at Hetzner.
Fact is? At this point, there's no simple/easy automatic fix for this. Data recovery is a science and even the basic, common tools need someone to sit down and ensure the data is there. If you're expecting to recover from this without massive amounts of downtime, you're going to be disappointed.
I'd suggest using testdisk or some file system specific recovery tool. Try one system, see if it works, and so on. There's no real way to automate the process but you can probably carefully do it in batches.
That said, there is a few very scary things in the questions and comments that ought to be part of your after action reports.
Firstly, you ran the command everywhere without checking it first. Run a command on one box. Then a few, then more. Basically if something goes wrong, its better to have it affect a few rather than all your systems.
Secondly
Scares me. File level one way backups are a solved problem. Rsync can be used to preserve permissions and copy over files one way to a backup site. Accidentally something? Reinstall (preferably automatically) rsync back, and things work. In future, you might use file system level snapshots with btrfs or zfs snapshots and shipping those for system level backups. I'd actually toy with separating application servers, databases and storage and introduce the principle of least privilege so you would split up the risk of something like this..
After something has happened is the worst time to consider this.
What can we learn from this?
Never run a command everywhere at once. Seperate out test and production machines, and preferably do production machines in stages. Its better to fix 1 or 10 machines rather than 100 or 1000.
Double and triple check commands. There's no shame in asking a co worker to double check "hey, I'm about to dd a drive, could you sanity check this so I don't end up wiping a drive?". A wrapper might help as well, but nothing beats a less tired set of eyes.
What can you do now? Get an email out to customers. Let them know there's downtime and there's catastrophic failures. Talk to your higher ups, legal, sales and such and see how you can mitigate the damage. Start planning for recovery, and if needed you're going to have to, at best, hire extra hands. At worst, plan for spending a lot of money on recovery. At this stage, you're going to work at mitigating the fall out as well as technical fixes.
Boot into the rescue system provided by Hetzner and check what damage you have done.
Transfer out any files to a safe location and redeploy the server afterwards.
I'm afraid that is the best solution in your case.
When you delete stuff with
rm -rf --no-preserve-root
, its nigh impossible to recover. It's very likely you've lost all the important files.As @faker said in his answer, the best course of action is to transfer the files to a safe location and redeploy the server afterwards.
To avoid similar situations in future, I'd suggest you:
Take backups weekly, or at least fortnightly. This would help you in getting the affected service back up with the least possible MTTR.
Don't work as root when not needed. And always think twice before doing anything. I'd suggest you also install safe-rm.
Don't type options that you don't intend to invoke, such as
--no-preserve-root
or--permission-to-kill-kittens-explicitly-granted
, for that matter.I've had the same issue but just testing with a harddrive, I've lost everything. I don't know if it'll be useful but don't install anything, don't overwrite your data, you need to mount your hard drives and launch some forensics tools such us autopsy, photorec, Testdisk.
I strongly recommend Testdisk, with some basics command you can recover your data if you didn't overwrite it.
The best way to fix a problem like this is to not have it in the first place.
Do not manually enter an "rm -rf" command that has a slash in the argument list. (Putting such commands in a shell script with really good validation/sanity routines to protect you from doing something stupid is different.)
Just don't do it.
Ever. If you think you need to do it, you aren't thinking hard enough.
Instead, change your working directory to the parent of the directory from which you intend to start the removal, so that the target of the rm command does not require a slash:
I would try to recover backup machine, where all copies were stored:
dd
comand.testdisk
to recover files.So lets say you want to recover 1TB, You will need extra 2TB, 1TB for backup (1st step) plus 1TB for recovery (2nd step).
I did similar mistake with alias rm -fr [phone rang] and cd to precious directory. Now i always think twice and recheck couple times before i use rm or dd command.
As mentioned in another answer, Hetzner has a rescue system. It includes both a netboot option with ssh access as well as a java applet to give you screen and keyboard on your vserver.
If you want to recover as much as possible, reboot the server into the netboot system and then log in and download an image of the filesystem by reading from the appropriate device inode.
I think something like this should work:
Of course the redirection is done by the shell before the ssh command is invoked, so server.img is a local file. If you want just the root file system and not the full disk, replace
sda
bysda3
assuming you are using the same image as me.I would swear off using
rm
for the rest of my life and think that it's madness that trash-cli isn't the default removal command on nix systems.https://github.com/andreafrancia/trash-cli
I would make sure it is the first thing I install on a brand new system and
alias rm
to something that tells people to usetrash-cli
instead. It would also include a note about another alias that actually runs/bin/rm
but tells them to avoid using it in most cases.:( True story
I would advice in such case is unmount and use debugfs, and with help of lsdel you can list all recently removed files, which where not cleaned up from journals and then dump needed files. Fast search link for the same: http://www.linuxvoodoo.com/resources/howtos/debugfs
hope it will help someone. ;)
And yes, once of suggestions is to make script, which moved ream rm to real.rm and symlinc mv to rm ;)