I've just had a server outage at 4.59AM on Sunday morning, and looking through our uptime records going back to 2006 all except for 4 outages out of 20 occured between 11PM and 6AM. (Only looking at unplanned downtime on web and database servers, not app-servers on an internal LAN.)
Do others also find similar behaviour for their servers? Is this just a fluke?
Edit: It was because so many outages (these are unplanned, not scheduled maintenance, and occured on our hardware, not the ISP's network) have occured between 11pm and 6AM that made me wonder is that just us...
The servers are busiest in terms of visitors between 1PM and about 10PM, while database backups happen throughout the day and a big backup (where compression uses more of the CPU) occurs around 4.30 every morning. But the outages have occured at any time during this window (also these 20 outages are events occuring on 1 of 5 servers or 2 firewalls - about a third of which were result of 2 different machines' hard-drives failing). There is nothing indicating that the server was doing anything specifically because it was the small hours of the morning.
Typical "working hours" are no more than 40 hours of a week. Less in some parts of the world. A week contains a total of 168 hours. 40/168 = less than 24% of the time of a week is 'working hours'.
That suggests that failures of systems that are running 24/7 will occur 3-times more often during non-working hours than working hours.
Obviously, there are many other considerations that could go into this; multiple shifts, peak times (which for many, might further bias failures toward non-working hours), etc.
Yes, we find it, and no, it's no fluke. Your servers hate you, I'm sure. I know my servers hate me, and whilst they'd happily see me dead, if they feel themselves flagging I'm sure they hold on until their ntp daemons whisper in their ears that it's the middle of the night, and now is a good time to die. They know that to fail at 1030h will ruin my day, but to fail at 0345 will ruin my night, drag me down to London in the dark, and ruin the next day as well. They love that.
After having had a corporate firewall fail on me at a most inconvenient time due to a failed HDD, I separated the disc controller board from the HDD, cut it into four, and thereafter wore - and still wear - a quarter of its board, like a scalp, hanging from my "chain of office" (the lanyard with all the various access tokens I use at all my various sites). I am sure the sight of this grisly relic, in their plain view, kept its brother and sister servers largely in line henceforth, the penalty for failure being thus clearly displayed.
(In case anyone suffers a sense of humour failure, this post is a joke; except the bit about the HDD controller, which is absolutely true, and works.)
The time between 11PM and 6AM seems like a typical time to have nightly cron jobs running. Perhaps some of them put a bit of extra strain on your servers, increasing the risk of a pending failure to happen just then.
Overnight is when most infrastructure changes occur. Networks and other resources may go down. It you are using remote monitoring, you will see your site go down because it is not reachable. Knowing the maintenance windows for your various resources will help eliminate these outages from actual outages.
As other have noted, on average outages are more likely outside of office hours based on hours on the clock. Given weekday availability and an 8 hour workday, only 1/3 of the outages should occur during office hours. Add in weekends and even less of the outages occur during workdays.
Track the reasons for the outages, and how they were detected. You will find some outages due to resources like the network being down. These may appear as mysterious outages where the site disappeared for a few minutes and came back without intervention. I would expect many of your overnight outages were infrastructure changes.
Infrastructure changes are usually scheduled, so you should be able to arrange to be notified of them. You can then adjust your response accordingly. Your outage log should reflect that the outage was due to the change. Also record any intervention that was required. You may need to add recovery code to your application to handle database restarts or other such resource changes.
Knowing the maintenance windows for various resources can help identify which resources are causing unplanned outages. You may need to trace your resource dependencies, as networked disk and databases will depend on the network infrastructure. Likewise, the database may depend on networked disk storage.
I had a Voip server die on me in the last 3 months. Die, perhaps isn't the best word, since the machine would be bootable after a kernel panic. Typically, the machine would function flawlessly between 7am and 7pm. Then, at random intervals separated by 1-30 days, it would be locked-up and unresponsive at the system console when I returned to the office at 7am.
After about 12 iterations of this situation... which invariably happened between 11pm and 7am, it was determined that the motherboard had failed, and specifically, the capacitors were to blame. I think I read somewhere that temperature extremes can hasten this death. I suppose my small office is not unusual, but I typically allowed the temperatures to swing as much as 15 degrees F above and 20 degrees below 75 degrees during the off hours. So, I believe, that small-time operations, that aren't using a chilled data center, are likely to suffer temperature induced failures during the wee hours of the morning.
My recollection, again, is that the logs showed failure during the 8 hours before we opened our shop in the morning -- always.