I'm a moderate web developer. I haven't managed any high traffic websites. Generally, I observe that only high traffic websites are down for maintenance. stackoverflow.com will also go down for maintenance.
I always wonder. What kind of maintenance do they do? I mean, the process is automated.
user request --> web server --> server side programs --- > Database server.
What is there to maintain?
Usually the highest traffic sites don't go down for maintenance. They're designed so they don't have to. (Depending on the site, that can be very tricky. It's not just a case of running multiple servers, although obviously that's the starting point.)
However, usually "Site down for maintenance" means any of:
They may want to run updates (or fixes) on many of the different pieces of software running on the server, including (but not limited to):
Beyond that, they could also be doing hardware maintenance, such as adding a new hard drive, upgrading a motherboard, putting in faster RAM, or swapping out network cards. There's plenty of things, both hardware and software, that can be upgraded or modified, really.
Now if they have a backup server (or a cluster or something of the sort), this can be transparent, but if it's literally one box serving the pages...well, it pretty much has to go down.
Since you're coming from a coding background, I'll base my analogy there. Imagine that being a sysadmin is just like programming, except you'll be called on to code in a different language every couple of hours. And sometimes it's Pascal.
Truly, though it could mean anything. Sometimes a mouse chews its way into a warm place. Or a single point of failure makes itself known. Eliminating downtime is what we pursue ... like writing code that works perfectly on the first compile.
Liken a single server to a running vehicle. If you turn off the vehicle, your 'server' is down.
There are some things you can do while the car is running - add fuel, oil, washer fluid, clean the windshield, change gears, etc.
However, you can't replace the fuel line in the car while it's running - liken fuel to data; you don't want to lose any, or you'll have unhappy customers.
These downtimes vary based on the level of administrator skill and the complexity of changes. On larger, high traffic sites - the only way this could feasibly happen is if there's a major architecture change; something that, no matter how many servers and redundancies you have, the architecture needs to change all at once.
This is rare for very large systems - I liken it to replacing the fuel line on a running vehicle: for many, it's not feasible to do (or worth the effort/risk) at certain skill and resource levels. However, for places that have the skills and resources, they can perform a fuel line replacement on a running vehicle. Liken that to architecture migration; they do it a lot more complex.
Could be upgrade of servers, frameworks, databases Moving to a new datacenter and shutting the old servers own so that nobody can connect Patching of operating systems or software that runs on those servers
basically anything that could make the site unavailable for a certain amount of time
Regular maintenance items would be things like rebuilding caches, upgrading software and/or templates, doing some data trawling for statistics, various routine maintenance tasks like backups, (which work better on quiet systems) and a variety of other expensive, infrequent tasks.
Some tasks just require pouring over a lot of data, and it's not really efficient to do after each change. Recommendation databases are one thing that comes to mind, as you don't need up to the second data, and it's rather expensive to calculate common purchase patterns across many different users. This is an N^2 complexity problem with some algorithms, and tends to take both a lot of data trawling, and lots of memory.
Financial institutions may use the down time to calculate and make interest payments to accounts, or close outstanding transactions and calculate reconciliation balances. This data in theory should never change after reconciliation, so it makes sense to write it to WORM storage at this point.
Backups are a major item that's often done during downtime because high Disk I/O tends to bring even very powerful servers to their knees, and taking the site offline can help speed the backup process. I remember one organization I was at, where they had a very large customer RAID array, and the backup team kept complaining because their backup window for this one customer typically extended 22-24, and at one point 26 hours. A small amount of quiet time can decrease that window substantially.
Defrag the disk arrays. Its faster and safer to defrag servers when they are offline, allowing the CPU and disks to focus on that task rather than running 1000 websites. Its better to tell people to come back later, than to give them a poor user experience.
If its a windows server, you can crash it by running defrag while memory usage is over 50%. This is because at this point windows starts to rev up the page file. I learned this the hard way.