What I tell my boss when I break stuff
The guys from Standard Bank gave everyone a virtual sledgehammer at the door and told them to go out and break things, on their distributed systems. Bonus points if you can also give it cool names like Hulk and Necromancer (used to automate and revive killed off processes that Hulk smashed).
Derick Chung and his team demonstrated how they are currently conducting and experimenting with Anti-fragility testing (chaos engineering) on their distributed systems in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Key Points to Chaos Engineering:
- Know the state of your system (is it steady?)
- How will the system react
- Real world examples
- Run continuously
- Low tolerance for outages
Once a level of maturity and trust has been achieved in relation to experimentation and testing, then it is advisable to also test in live production environments as it is a test environment where users behave differently. WARNING: always try to minimise the blast radius – anything that might be critically affected. E.g. don’t take down one node when you only have two nodes in operation! (Learnt that the hard way!)
The Standard Bank team were also nice enough to share with the delegates the tools they are currently using for Chaos Engineering;
- Skynet for automated VM builds
- Chef for configuration management
- Ability to capture metrics
- Bots to automatically do stuff
- An awesome boss
Was this blog post helpful? Don’t you think your buddies would like to know more about what was learnt at Africa DevOps Day 2017? Please share on Twitter, Facebook & LinkedIn.