"Can we upgrade our existing production EL5 servers to EL6?"
A simple-sounding request from two customers with completely different environments prompted my usual best-practices answer of "yes, but it will require a coordinated rebuild of all of your systems"...
Both clients feel that a complete rebuild of their systems is an unacceptable option for downtime and resource reasons... When asked why it was necessary to fully reinstall the systems, I didn't have a good answer beyond, "that's the way it is..."
I'm not trying to elicit responses about configuration management ("Puppetize everything" doesn't always apply) or how the clients should have planned better. This is a real-world example of environments that have grown and thrived in a production capacity, but don't see a clean path to move to the next version of their OS.
Environment A:
Non-profit organization with 40 x Red Hat Enterprise Linux 5.4 and 5.5 web, database servers and mail servers, running a Java web application stack, software load balancers and Postgres databases. All systems are virtualized on two VMWare vSphere clusters in different locations, each with HA, DRS, etc.
Environment B:
High-frequency financial trading firm with 200 x CentOS 5.x systems in multiple co-location facilities running production trading operations, supporting in-house development and back-office functions. The trading servers are running on bare-metal commodity server hardware. They have numerous sysctl.conf
, rtctl
, interrupt binding and driver tweaks in place to lower messaging latency. Some have custom and/or realtime kernels. The developer workstations are also running a similar version(s) of CentOS.
In both cases, the environments are running well as-is. The desire to upgrade comes from a need for a newer application or feature available in EL6.
- For the non-profit firm, it's tied to Apache, the kernel and some things that will make the developers happy.
- In the trading firm, it's about some enhancements in the kernel, networking stack and GLIBC, which will make the developers happy.
Both are things that can't be easily packaged or updated without drastically altering the operating system.
As a systems engineer, I appreciate that Red Hat recommends full rebuilds when moving between major version releases. A clean start forces you to refactor and pay attention to configs along the way.
Being sensitive to business needs of clients, I wonder why this needs to be such an onerous task. The RPM packaging system is more than capable of handling in-place upgrades, but it's the little details that get you: /boot
requiring more space, new default filesystems, RPM possibly breaking mid-upgrade, deprecated and defunct packages...
What's the answer here? Other distributions (.deb-based, Arch and Gentoo) seem to have this ability or a better path. Let's say we find the downtime to accomplish this task the right way:
- What should these clients do to avoid the same problem when EL7 is released and stabilizes?
- Or is this a case where people need to resign themselves to full rebuilds every few years?
- This seems to have gotten worse as Enterprise Linux has evolved... Or am I just imagining that?
- Has this dissuaded anyone from using Red Hat and derivative operating systems?
I suppose there's the configuration management angle, but most Puppet installations I see do not translate well into environments with highly-customized application servers (Environment B could have a single server whose ifconfig
output looks like this). I'd be interesting in hearing suggestions on how configuration management can be used to help organizations get across the RHEL major version bump, though.
(Author's Note: This answer refers to RHEL 6 and prior versions. RHEL 7 now has a fully supported upgrade path from RHEL 6, the details of which are at the end.)
To start, I should note that there are two ways to do the in-place upgrade:
linux upgradeany
.redhat-release
RPM manually, runyum distro-sync
(this is oversimplified a bit) and reboot.Method 1 is merely unsupported. Method 2 is for Real Cowboys. In addition to the recommended fresh installs, I have done both of these...
Do I need support?
Support has two complementary meanings in our world. The first is that a product has a given feature (e.g. "Postfix supports SMTP"). The second is that the vendor will talk to you about it. Which definition is meant is not always clear from context.
To accomplish a task, you obviously need support in the first sense. Where vendor support comes in is to assist you in resolving issues and giving the vendor feedback as to what features need to exist or be improved. Many sites pay a fortune for vendor support when they have the in-house expertise to resolve any issues that may arise, faster and even cheaper than the vendor could. Whether to buy vendor support is ultimately a business decision you will have to make (or advise management on).
Why not do an in-place upgrade?
This is what Red Hat says about it:
They further warn:
Of course, they then describe how to do an in-place upgrade via method 1, just in case you really want to do it. The feature exists and Red Hat puts development time into it, so it is supported in that the feature exists. But if something goes wrong, Red Hat will tell you to install fresh; they will not provide vendor support for things that break as a result of the upgrade.
For the record, I've never actually had a problem with an in-place upgrade of a RHEL/CentOS or Fedora system that I couldn't resolve myself. The typical problems come from renamed packages, third party repositories and the occasional version mismatch between the i386 and x86_64 architectures of a package. The installer is a bit better at handling these than
yum
, I think.How should I upgrade?
I generally warn people that they should plan on a maintenance window every 3-4 years to update RHEL systems from one major version to the next. While upgrades generally go smoothly, the unexpected can always happen.
For both of your environments, I expect an in-place upgrade would work, though I strongly recommend testing it thoroughly first. P2V a representative sample of the servers and run through the in-place upgrade on the virtual systems to see what problems you're going to run into. You can then plan the actual production upgrade based on better knowledge of what will happen.
For a large deployment such as you have here, consider using Limoncelli's "one-some-many" approach. Upgrade one machine, see what problems occur, solve them, then use lessons learned when upgrading a small batch of machines, repeat the lessons learned thing, then when you believe you have all the kinks worked out, upgrade large batches of them.
At a time like this, I also recommend taking a long hard look at your application deployment process. If it isn't sufficiently automated that you can kick it off with a single command and be reasonably sure that the app will be deployed correctly, then perhaps the developers need to get to work on that. Having such a deployment process would make it much easier to do a fresh installation of the newer version of EL and then deploy onto it.
Will switching distributions help?
Debian-based distributions do have a supported in-place upgrade method, and it mostly works, but it is not immune from problems. Lots of things broke for people upgrading from Ubuntu 10.04 LTS to 12.04 LTS via the supported method, for instance. It's not clear that Debian or Canonical are putting a sufficient amount of development time into "supporting" this feature, i.e., making sure it works. And you still actually have to buy vendor support for this distribution if you want someone to hold your hand. So I doubt you will gain much from switching to such a distribution.
You may gain by switching to a rolling-release distribution such as Gentoo or Arch. However, this also doesn't make you immune to problems; it just means you have to deal with the upgrade problems continuously over the life of the server (e.g. whenever you or the developers decide to update something on the system), rather than all at once at a well-planned distribution upgrade time. You also have no vendor to provide support.
What does the future hold?
The Fedora Project is working on a tool to improve in-place upgrades. They had a tool called
preupgrade
which was abandoned and replaced with a new tool called fedup beginning with Fedora 18. This was added to RHEL7 and now in-place upgrades have full support, at least from RHEL 6 to RHEL 7. From my own experience I can say that whilefedup
still has some kinks, it is shaping up to be a very useful tool.CentOS is also experimenting with a rolling-release type of repository, but it only applies between minor versions (e.g. 6.3-6.4).
My take on your last paragraph:
I think the real value of configuration management systems, especially in the context of Environment B, is that they provide the tools to construct a service independently of the servers which run it. If a CMS wasn't used to create the existing services, then it probably won't help very much in recreating the services.
I know this doesn't solve your immediate problem, but to me it stems from the organisation thinking in terms of servers rather than services. In service-focused thinking, the personality of individual servers need not be maintained as long as the service continues to function. If a CMS is used in a disciplined manner to build the entire service, then moving that service to another system should be relatively straightforward, because all of the machine's personality will be built by the CMS.
P.S. I'm not exactly sure what's significant about the ifconfig output in this context - it's produced by a configuration file and some scripts (otherwise it wouldn't be there on boot), and those can be managed by a CMS, if needed.