We have a hosted dedicated server running CentOS 6.5 that is the production server for our website. I have a local server that has been repurposed as a backup server also with CentOS 6.5 and is here simply to store backup files. Both servers have all appropriate updates installed.
On the production server, I have backup scripts all scheduled to run via cron to both create the backup files and to rsync them to the backup server. The scripts all execute on time, but rsync fails due to an SSH timeout while trying to contact the backup server.
Here's where I am stuck. I can use both PuTTY and WinSCP to log in to the backup server via SSH even though the scripts timeout. As soon as I restart the sshd service on the backup server, the scripts on the production server run without a hitch (both via command line and cron).
It's like the backup server decides to quit listening for the production server after so long.
Some additional details before they are asked:
- Backup server firewall permits all connections from the production server
- Scripts work fine from both command line and cron so long as the sshd service on the backup server has been restarted (i.e. not a script issue)
- SSH uses Public Key Authentication to validate connection
- I cannot find any errors in the SSH logs for the backup server. Again, it's like it simply stops listening for the production server even though I can connect from a separate machine.
I really need some assistance on what to look for. I could setup a script to restart sshd service on the backup server right before the production server runs the script, but that feels too much like a hack rather than a fix. Any assistance would be greatly appreciated.
Edit
Sample script requested. This backs up the databases and rsyncs them along with an rsync of the entire website directory:
#!/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
# -----------------
# NIGHTLY BACKUP SCRIPT
# -----------------
# --Set log file and capture parameters
exec &> /path/logfile.log
#
# --Set Current Date Time
now=$(date +"%Y-%m-%d")
#
# --Backup Database 1
/usr/bin/mysqldump -u USER -pPASSWORD DATABASE1 | /bin/gzip > /path/database1-$now.sql.gz
#
# --Backup Database 2
/usr/bin/mysqldump -u USER -pPASSWORD DATABASE2 | /bin/gzip > /path/database2-$now.sql.gz
#
# --Sync Database Backups to Remote Server
/usr/bin/rsync -avz -e "ssh -v -p # -i /path/key" /path USER@IP:/path
#
# --Sync all Website Files to Remote Server
/usr/bin/rsync -avz --delete -e "ssh -v -p # -i /path/key" /path USER@IP:/path
Edit 2
Log output requested. Below is the log output for the rsync line from the script above to "Sync Database Backups to Remote Server"
OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to IP [IP] port #.
debug1: connect to address IP port #: Connection timed out
ssh: connect to host IP port #: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
It was also requested that I run the following command: nc -v IP PORT
However, the results were virtually the same as the log:
nc: connect to IP port # (tcp) failed: Connection timed out
After restarting the sshd service on the backup server and re-running the 'nc' command, I got the following:
Connection to IP # port [tcp/fpo-fns] succeeded!
SSH-2.0-OpenSSH_5.3
As a test, I created a script to run hourly and rsync the website directory from the production server to the backup server. I thought I would at least find out an approximate time for when the backup server stopped allowing connections from the production server. Instead, the hourly script and all other scripts have run without issue since yesterday.
While I don't consider this an actual "fix", it has at least appeared to resolve the issue and I no longer have to initiate the backups manually by restarting the sshd service on the backup server and then running scripts on the production server. If anyone has any insight on why this would fix the issue, please let me know in the comments as I still would like to figure out a root cause.