So we've all probably had this situation: you debug some problem, only to realize it was caused by a config change you made six months ago, and you can't remember why you did it. So you undo it and fix the problem, and now some other problem comes back. Oh yeah, NOW I remember! Then you fix it properly.
It's because you didn't take proper notes, you fool! But what's a good way to do this?
In engineering we have loads of software meant to help us detect and track changes. Source control, code reviews, and so on. Every change is tracked, every change requires a comment as to what it is. And typical engineering departments require good comments so that in six months when you're figuring out why you broke it like that, you can use a historical 'blame' feature or binary search builds to pinpoint the problem. These tools are very effective communication tools and historical records.
But in serverland, we have 500 different services, all with different ways of configuring them. And they don't always have a text format (consider setting permissions on a folder or altering the pagefile location) though they may have a textual representation.
In our environment, we check in what config files that we can into Perforce, but there are very few of those. Can't exactly check in the Active Directory DB..though perhaps a dump that could be diff'd...
In the past I have tried keeping a manual change log in our wiki, but it's super hard to maintain the discipline to do this (I know, not a good excuse, but it really is tough).
MY QUESTION: What strategies and tools do you use to cope with this problem of tracking configuration changes to your servers?
-- Update --
Note: I'm not looking for shared-note taking tools (I'm familiar with OneNote, etc) so much as automated tools specifically meant to help with tracking server changes. There's no comprehensive tool for tracking server config changes, but perhaps there are some for specific applications like GPO's.
Also I am very interested in specific strategies that you've found useful. "We share notes in Sharepoint" is pretty vague. How do you maintain the discipline? What format do you use to track your changes? How do you organize your change data? I'd really like examples as well as ideas.
In Linux land, people are pursuing a couple of different strategies:
One of the problems in this situation is that, really, it's a combination business process/technological problem. And it is definitely bigger than just tracking what changes an admin made. You also need to keep an eye out for unexpected changes, and good coordination between admins or units so that a change on an AD controller doesn't break a database permissions setting on some departmental server. I.e., your question is a giant can of worms :)
In my organization, we are about a year into rolling out processes and systems to address this. For the business process side we formed a Change Management team. According to SOP all changes to production environments are coordinated through them. They compile all the changes, along with scope, systems affected, services affected, etc. Enforce good documentation on the changes, as well as both roll-out and roll-back plans. Host weekly (open) meetings to go over upcoming environment changes, then send emails out detailing all of these changes. The end goal with this process is so that, effectively, everybody in IT knows everything else that is going on. This helps stop the problem of, for example, a SysAdmin installing a kernel patch and rebooting a system that will take down the timeclock database.
As for the technological side I can only speak of the Unix/Linux guys since I don't deal with Windows. They have been rolling out Puppet, by Reductive Labs, for configuration management of all of those systems. Simply, is a client/server system where one defines a machine configuration on the server, and the client pulls those chances every so often (30 minutes by default). Additionally, if any chances are made to managed files locally then they are reverted back at that time as well. We use it for managing running services, firewall configurations, user authorization, etc.
I would also recommend looking into something like TippingPoint. It is a client service that watches system configuration, and sends alerts on changes. It makes us security folks most happy. It is largely used for tracking malicious or unpublished changes.
I have been at 4 or 5 companies now I don't really remember.
We all had this problem. None of us have solved it 100 percent, but at the company I am at now we have what I think is the best strategy to date.
Sharepoint/Wiki/Evernote/PINs
There's probably better tools for some of these, but this is what we use:
For Windows, check out Microsofts System Center series or any other competitor in configuration and service management for that platform.
The changes need to be routed through a decent change management routine which by itself approves and logs them before they're actually done. This can be 100% manual for starters. With some of the better integrated tools you could ask the tool to do the actual changes and get "automatic" logging out of it to a central configuration database - rather than go bare-hands into an individual server's console, digging through settings by hand to try and fix a problem cowboy-style.
You absolutely should have a change management process in place, especially if there are multiple people who have the ability/access to make changes on the system level in your environment. This also provides a way for management to sign off on potential changes, however the downside it does induce latency in the change process if you can't do changes on the fly.
Some ways of tracking changes might include the validation of events in your SEM (assuming you have a Security Event Manager) or tools such as Nessus (with a lot of work can audit your environment to find changes).
This is a more localised, *nix based answer. I've not found any good tools to emulate it under Windows.
There's a few ways to implement this ... and to catch it when you forget.
Revision control systems like subversion, git, cvs or RCS is a good way of tracking the history of a config file. If you don't want to install a revision control system on your production servers, storing configuration file directories either locally or remotely using something like rsnapshot will give you most of the benefits of a RCS, but you lose the possibility of auditing or leaving commit logs (although this could be worked around with comments inside the files themselves).
To help you remember to log the changes, automated reporting of configuration changes via a nightly, cron'ed tripwire run is a good start. After building tripwire's database of the current state of files, any change to them will result in an email during the next run. You will continue to receive this mail until the database is updated, thus "resetting" the tripwire.
I would use an issue tracking system such as flyspray (any will do, but I like flyspray for non-programming stuff). Before anyone touches a config, the improvement/problem should be logged. When you fix/implement it, the changes go in the ticket.
A wiki can be nice to document the current setup, but it's easy for it to get out of date - and it seems to take more effort to update IMO.
You're not going to find something automated to do this - although you could probably set it up so changes to certain config files got automatically emailed to the issue tracker if you wanted.
I think it's just a matter of a good policy, low-barrier tools and discipline.
We created something homegrown to do change log tracking in our environment; it's not anything super-complicated, and it works quite well.
As I said, nothing fancy. It uses PERL CGI (was written a billion years ago), and a Google Search appliance for indexing.
Shortcomings:
Anyway, if after all that you'd be interested in the code, let me know and I can probably grab it to share.
As said, its often a cultural issue - after all, some development shops don't bother with comments anymore (self documenting code is a fashionable buzzword today!) and some use a version control system as a holy grail of historical records. Obviously, these aren't perfect.
So, the only true way to fix it is to make it a cultural solution. Ensure all reasons for change are logged in a bug tracker (or knowledgebase, or wiki), and ensure all changes are logged in a change control system.
We have emergency service customers, every change that happens to their system is logged, and every time we log into their system, we have to log it. For some of them, we have to phone for permission first (and I guess they log that too!). Every change is logged, and it'll be a disciplinary offence to change the customer system without logging it.
It sounds onerous, but its not. You quickly get into the habit of adding yourself to the access log and change log - its no worse than having to write a comment when checking in a code change.
I recommend a bugtracker as the change control reason log, as they're usually easy to update (I use Mantis).