As part of my job I manage a few tens of CentOS 5 servers, using puppet for the main set up. About half of our servers have a standardised set up for hosting various django sites, while the rest are a mish mash of applications.
I'm gradually sorting out our hosting practices, and I've now got to the point of working out how to manage security updates at the OS level. I'm wary of just having a cron job doing a yum -y update
but also don't want to have to go round each server in time and review every package with updates available, as that would take a while.
So I'm wondering if there are any good shortcuts or working practices that would minimise the risks involved and minimise the amount of time I need to spend. Or to put it another way are there any tools or practices that can automate a lot of the work while still giving control.
Steps I've decided on so far:
- disable all third party repositories and set up our own repository so I can control what updates go through there.
- we have staging servers for (most of) our production servers where I could do testing (but how much testing is enough testing?)
Also note that I've looked into the yum security plugin but it does not work on CentOS.
So how do you manage updates for significant numbers of CentOS servers running a heterogeneous array of applications?
In most of my environments, it's usually a kickstart and post-install script to get the main system up and current with updates at that moment. I'll usually have a local repo that syncs with a CentOS mirror daily or weekly. I tend to freeze the kernel package at whatever's current as of the installation time and update packages individually or as necessary. Often times, my servers have peripherals that have drivers closely linked to kernel versions, so that's a consideration.
CentOS 5 has matured to the point where constant updates aren't necessary. But also keep in mind that CentOS 5 is winding down. The rate of updates has slowed somewhat, and the nature of the updates is more inline with bugfixes and less about major functionality changes.
So in this specific case, the first thing you could do is build a local mirror/repo. Use your existing configuration management to control access to third-party repos. Maybe schedule policy to yum update critical or public-facing services (ssh, http, ftp, dovecot, etc.) Everything else will require testing, but I get the feeling that most environments don't run with fully-updated/patched systems.
There are many tools that can help with this! It general the package system and which packages go where is handled by configuration management. These tools usually cover more than just yum and the rpms though, and will save you time and prevent many many headaches!
The tool I'm most familiar with is puppet which I use to manage virtually every config in my environment. Here are some puppet examples for managing yum specifically:
http://people.redhat.com/dlutter/puppet-app.html
There are a number of configuration management tools currently available, these have pretty big user groups:
Implementing these in an environment will add years to your life. It reduces the number of headaches from poorly configured systems and allows easy upgrading/updating. Most of these tools can also provide some audit level functionality as well which can greatly reduce the time-to-repair for configuration mistakes.
In regards to your question about testing I've been using a staging environment that we direct some customers load to (usually beta customers or a small subset of production traffic). We usually let this cluster run new code for at least a couple days, up to a week (depending on the gravity of the change) before we deploy it to production. Usually I've found this setup works best if you try and figure out how long most errors take to discover. In heavily used systems this can be a matter of hours, in most environments I've seen a week is long enough to discover even uncommon bugs in staging/QA.
One really important part about testing is replication of data/usage. You mentioned you have staging versions of most of your production hardware. Do they also have identical copies of the production data? Can you replay any of the production load against it? Can you even make it part of the production cluster using traffic mirroring? This usually becomes a direct trade-off between amount of resources the business is willing to spend on testing/QA. The more testing the better, try not to self limit (within reason) and see what the business will support (then find a way to do 10% more).