Expert

No offense, but you couldn't break the build even if you tried

To integrate means to merge your code on to the same branch as the one your colleagues are working on. So obviously if your code breaks something you are potentially jeopardizing the work space - and pace - of your team mates as well.
To have a pristine integration branch means that it is buildable at all times.

Code should be verified through some kind of toll-gate criteria, before it’s accepted on to the integration branch. Anything that doesn’t meet the toll-gate criteria is rejected and will not enter the mainline. It is simply impossible for a developer to break the build.

Oooh - If only you knew what happened before the smoke came

When something goes south - it’s usually in the production environment - where you don’t have access to debug or profiling information.

Design you code, so it can produce an audit trail - a complete profile of states, sequences, data in and out. That should give you clues, when you try to do your code-scene forensics.

At least you’ll get some clues on how to reproduce the error in your development environment.

Murphy and me - Errors will eventually happen in production

A word of precaution; testing in production is not to be confused with releasing untested code.

It starts with acknowledgement that all serious problems are discovered in production and occurred because unforeseen things happened.

Deliberately go to your production environment and do unforeseen things like turn off a server, kill a process, pour coffee on your keyboard, upgrade a service during high-load.

If your system is built to survive it, then it should! You’re only sure it will if you (dare) test it.

A full ride produces a release candidate

Split up your builds and verifications into a pipeline consisting of multiple stages. Use this approach to keep your builds as fast as possible, your feedback loop as short as possible and your developers notified as quickly as possible despite having long-running builds.

In your pipeline, each step provides more confidence in your code than the previous one.

It's 10pm. Do you know where your code is?

Your software is in production, but how is it doing? You want to have insight into the runtime health of your system.

This includes easy access to runtime statistics such as feature usage, transaction throughput and error situations to ensure the service level. In addition, access to environment health like disk and memory usage, cpu load

Bonus points if your system can alert you before an error occurs.

Aim for a high "bus factor"

The bus factor measures how many people in your corporation need to be run over by a bus before you go out of business. If you have a key player who’s indispensable, then your bus factor is 1.

To raise the bus factor, you must make sure that important knowledge is shared and accessible to whoever needs it.

Don’t document your processes to the brink of boredom or maintain an internal wiki the size of Wikipedia itself. Build a learning organization that encourages people to share with colleagues, allocates time for research, designs for change and accepts automation as documentation.