Suddenly, for the last few days, it takes a very long time (up to 30 seconds) to establish a SSH connection to most, but not all, of my Amazon EC2 instances. The issue has been raised with Amazon to see if it's environmental with them, but I wonder if there's anything I can check on the instances themselves.
Most of that time is spent at the step:
Authenticating with public key "imported-openssh-key"
Once on the instance, changing users via
su - newusername
hangs indefinitely.
Other commands (ps, top, find) run fast as ever.
My application running on the instances (a web service) is very responsive. CPU, IO and disk load on the instances is not very high.
EDIT:
Last few lines of output from strace su - myusername as suggested by Dave:
connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("W.X.Y.Z")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll(
This line keeps recurring in 10 second intervals... it hangs at poll( for 10 seconds then repeats the same output.
The IP address referenced is the public IP address of our LDAP server.
The problem was that the instances were attempting to resolve the LDAP server via the public IP address rather than the private one. Opening the public IP to the other instances resolved the issue.
Usually when weird hangs happen, I can trace it back to RDNS - either the host you are connecting from doesn't have it set up or there is an issue with the server resolving RDNS.
However su shouldn't be doing anything with RDNS as far as I know.
What happens you trace the process with 'strace'?
EDIT:
So it seems like the connection to the LDAP server is timing out. Have you confirmed the LDAP server is working with other systems? Maybe you can trace the traffic on the LDAP server with:
To get an idea of what happens when it is connecting.