I've been using puppet for deployment of infrastructure, and most of the work I do is with Web 2.0 companies who are heavily into test-driven development for their web application. Does anyone here use a test-driven approach to developing their server configurations? What tools do you use to do this? How deep does your testing go?
I don't think you could use test-driven development. But you could certainly try unit-testing on new servers.
Basically you would need to deploy servers, start up the services in a test mode, and then run tests from another server (or series of servers) against the services. Then finally put them into production.
Maybe using python scripts to connect to databases, webpages, and ssh services. And then return a PASS/FAIL. Would be a good start for you.
Or you could just roll this up into a monitoring solution, like Zenoss, Nagios, or Munin. Then you can test, during deployment; And monitor during production.
I think Joseph Kern is on the right track with the monitoring tools. The typical TDD cycle is: write a new test that fails, then update the system so that all existing tests pass. This would be easy to adapt to Nagios: add the failing check, configure the server, re-run all checks. Come to think of it, I've done exactly this sometimes.
If you want to get really hard-core, you would make sure to write scripts to check every relevant aspect of the server configurations. A monitoring system like Nagios might not be relevant for some of them (e.g., you might not "monitor" your OS version), but there's no reason you couldn't mix-and-match as appropriate.
While I haven't been able to do TDD with Puppet manifests yet, we do have a pretty good cycle to prevent changes from going into production without testing. We have two puppetmasters set up, one is our production puppetmaster and the other is our development puppetmaster. We use Puppet's "environments" to set up the following:
Our application developers do their work on virtual machines which get their Puppet configurations from the development Puppetmaster's "testing" environment. When we are developing Puppet manifests, we usually set up a VM to serve as a test client during the development process and point it at our personal development environment. Once we are happy with our manifests, we push them to the testing environment where the application developers will get the changes on their VMs - they usually complain loudly when something breaks :-)
On a representative subset of our production machines, there is a second puppetd running in noop mode and pointed at the testing environment. We use this to catch potential problems with the manifests before they get pushed to production.
Once the changes have passed, i.e. they don't break the application developer's machines and they don't produce undesirable output in the logs of the production machines' "noop" puppetd process, we push the new manifests into production. We have a rollback mechanism in place so we can revert to an earlier version.
I worked in an environment that was in the process of migrating to a TDD operations model. For some things like monitoring scripts this worked very well. We used buildbot to setup the testing environment and run the tests. In this case you approach TDD from the perspective of "Legacy Code". In TDD "Legacy Code" is existing code that has no tests. So the first tests don't fail, they define correct (or expected) operation.
For many configuration jobs the first step is to test whether the configuration can be parsed by the service. Many services provide some facilities to do just this. Nagios has preflight mode, cfagent has no act, apache, sudo, bind, and many others have similar facilities. This is basically a lint run for the configurations.
An example would be if you use apache and separate config files for differing parts, you can test the parts as well just use a different httpd.conf file to wrap them for running on your test machine. Then you can test that the webserver on the test machine gives the correct results there.
Every step along the way you follow the same basic pattern. Write a test, make the test pass, refactor the work you've done. As mentioned above, when following this path, the tests may not always fail in the accepted TDD manner.
Rik
I believe the following links could be of interest
cucumber-nagios - project which lets you turn your Cucumber suite into Nagios plugin and which comes with step definitions for SSH, DNS, Ping, AMQP and generic "execute command" types of tasks
http://auxesis.github.com/cucumber-nagios/
http://www.slideshare.net/auxesis/behaviour-driven-monitoring-with-cucumbernagios-2444224
http://agilesysadmin.net/cucumber-nagios
There is also some effort on the Puppet/Python side of things http://www.devco.net/archives/2010/03/27/infrastructure_testing_with_mcollective_and_cucumber.php