I had a problem yesterday with a server that lost connection (S1). From that server, there was a dir shared with NFS to another one (S2), no homedir and not in $PATH, but a dir to store old files for archiving. S1 was back online after a few hours, but now I cannot access S2 because of this (and I'm sure it's because of that because all other services are running without any problem). The ssh connection hangs here: debug1: Entering interactive session. I know a reboot will do the job, but considering this is the NAS of a big app, my bosses will kill me if I do it. Is there any other way to get over this? I tried with different users, but all of them hang in the same place. I connected with HP iLO and not even there I cannot use my username.
Thanks in advance.
(You don't per chance have automounted directories on S2, do you?)
Try using ssh without an interactive session:
The "-vv" has ssh print extra output-- can't hurt-- and the "-t" tells it to allocate a TTY even though it's running a command instead of starting an interactive shell. The command, env, sets a bunch of MAIL* environment variables to nothing, which can be useful to know about if you have mail-on-NFS, and then, finally, launches a simple shell.
Alternately, try
HOME=/ /bin/su -
instead of thesh
, if appropriate.If you do get in, definitely try to unmount the NFS mounts. If it fails (likely), try it with
-f
. If that fails (still likely), Linux has a-l
option to do a lazy unmount: it'll detach the mountpoint from the filesystem tree, which should make any new processes responsive. Any existing processes will still be hung, though, and there's no way around that except a reboot.If I read your message report: A user is trying to ssh to user@S2 S2 mounts a filesystem from S1 S1 earlier had a problem that caused an NFS error occurred on S2. The filesystem mounted on S2 is NOT a home directory.
Are you using an automounter? Is this linux or some other flavor of UNIX?
This type of problem makes sense if the missing NFS mount is a home directory or somehow accessed during the user's login process - the login process attempts to access that directory and it gets caught in disk wait. As the authentication is succeeding, it pretty much has to be one of these issues.
So you are 1000% sure that the user's home directory is not NFS? If it is not, you should be able to read the user's dot files on S2 by logging into the system as root and checking for any instances where they interact with the problematic NFS filesystem
You should be able to verify by logging into the system as root (via iLO console if nothing else) and do a: ps auxww | grep D
You can get onto the system as root correct? Or is there something I don't understand?
Forcing an unmount and then restarting the NFS processes on S2 and then remounting should fix this, though you may have a bunch of stuck processes that won't go away until reboot.