Comprehensive Software Reviews to make better IT decisions
Engineered Chaos Will Help You Test Your Mature BCP
What Is Chaos Engineering?
Modern business is complicated. Processes, staff, IT systems, suppliers, capital equipment, desk-side software, and end-user devices all interact in a system that delivers value to the business. A disruption to one part of the system can affect the others, often in unpredictable ways.
Modern technology platforms are complex in and of themselves. In the context of IT, events as different and as unpredictable as an update script run amok or serious weather conditions incapacitating critical data center infrastructure can lead to massive business disruptions, lost customer goodwill, and lost business value. Chaos engineering addressed the challenge of developing resilient cloud-based and distributed IT systems.
“Chaos Engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent conditions in production."
– Principles of Chaos
The concept was pioneered by Netflix to develop resilient cloud-based streaming services. How would services keep running if a particular server, node, or rack failed? In the words of their engineering team, “Just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these ‘once in a blue moon’ failures.” Rather than planning for every possible failure, the team introduced real failures into its production environment to ensure the system and the team could react to and resolve incidents as intended.
To put the idea into practice, Netflix engineers built Chaos Monkey – a tool that "randomly disables our production instances to make sure we can survive [common failures] without any customer impact.” Artificial “experiments” created real breakdowns in production during the work day so Netflix’s engineering team could react, resolve the issue, and build solutions to automatically prevent or resolve similar incidents in the future. When a similar problem happened off-hours, a battle-tested solution was already in place.
What Does Chaos Engineering Mean for BCP?
Chaos engineering was originally conceived to test and improve the resiliency of IT systems. How might we apply the ideas from chaos engineering to improve business resiliency?
Chaos engineering recommends running live, controlled tests during business hours with little or no warning, reviewing the results and making changes to existing plans. Rather than testing IT systems, you’re testing the resiliency of business processes.
Imagine running the experiments off this list somewhere in your own organization:
- Unexpectedly close off the space where a department resides, requiring immediate relocation.
- Remove critical business unit leaders during peak production times.
- Direct 40 percent of staff not to come in for a given day or shift, or to only work from home.
- Only use recovery team alternates to manage a response and recovery, while the primary people sit on the bench.
- Shut off the phone system unexpectedly.
- Shut down one or more critical systems.
- Introduce a systems disruption that requires data synchronization.
- Run the department or entire building using the alternate worksite strategy for a couple of days.
- Require the business unit to use manual workaround processes for a shift.
- Require the use of an alternate supplier.
List above from BCMMETRICS
I’d wager you’re feeling at least a little queasy reading that list. You may know that no part of your organization is ready to do any of the above. But there’s no arguing that you’d get real insight into your business’ ability to withstand real disasters. To make it worth your while, follow the steps below to manage the risk of this type of testing.
Is This Actually Doable?
To make this approach practical, you’ll have to carefully manage the risk of tests. Running tests during business hours has real potential to damage your business, frustrate for your staff, and harm your customers. It’s your responsibility to coordinate with business unit managers to minimize that “blast radius.”
It’s also important to clearly define what you’re doing with testing. Follow these steps when you're designing a test.
- Identify a “steady state” for the process you decide to test. Define the output of the process in that steady state. Designate a control and an experimental group, and assume that all things being equal the steady state output will continue in both groups. (For example, one part of a team might work from home, and another might stay in the office.)
- Modify variables in the experimental group in ways that reflect real world events – for example, a system failure, a supplier disruption, or an office closure.
- Try to disprove the “steady state” hypothesis by comparing the outputs of the control and experimental groups. Would customers notice degraded performance? Focus on the outputs of the system, not its internal performance.
List above adapted from Principles of Chaos
I was recently working with the IT team from a hospitality group that told us a story about one of their restaurants. In the restaurant, customer orders entered by wait staff would display on screens for the line cooks. Problem was, system downtime and user errors meant that orders often displayed incorrectly, or not at all. Compounding the problem, the team didn’t communicate problems well.
The head chef became so frustrated with the problems that one night he started unplugging the kitchen display system during the dinner rush, forcing the back-of-house team to rely on verbal communication to assemble orders. A few nights later, he did it again. And then again. It was painful – particularly at first – but it worked. The next time the system went down, the back-of-house team kept going without missing a beat. And even more than that, communication between team members improved.
Where our chef decided to jump in with both feet, you may not be in a position to do so. If the ideas above interest you, we recommend following the steps below.
- Real business continuity plans are the first step. You need a BIA to set recovery objectives and a business resumption plan to guide recovery efforts. These exercises will help you identify gaps between your recovery requirements and recovery capabilities, and will highlight areas where testing might be valuable.
- Seek approval from your business unit managers. Continuity plans, after all, are owned by those business unit managers.
- Don’t start with live testing. Do thought experiments and tabletop exercises first to mitigate the risk of live tests. You’ll likely come across obvious improvements that can help minimize the impact of testing.
- Decide how you’ll measure success. Are you measuring time to recover? Lost productivity? Customer satisfaction? Figure out how you’ll measure what you’re trying to measure, and compare the results to past exercises or to your steady-state baseline.
- Start small. Limit your ambitions for early tests. For example, limit the impact to a small group, less critical systems, and off-peak hours.
- Select high-impact, high-frequency events for simulation first. Our chef in the story above was frustrated to distraction by system outages that compromised customer experience, so he decided to improve his team’s resiliency for that problem first.
- Dedicate time to analyze the results and recommend improvements. The goal is not to survive from one test to the next ‒ it’s about improving resiliency. That needs dedicated time.
- Iterate on and streamline tests. Add new complexities. See where you can automatically schedule and delegate running the exercise and post-mortem to each business unit.
- Communicate successes to make the case for further testing. You’ll also likely find you can’t go back to the same well over and over – consider the other groups you’ll work with to continue testing exercises.
The principles of chaos engineering can support business continuity testing, but it’s likely an approach best suited to organizations with mature business continuity plans. Build your BCP first – then decide if live testing is feasible.
Want to Know More?
Zerto has enhanced its Azure integration to reduce achievable RTOs and recovery cost. Specifically, Zerto’s latest release leverages Azure’s native Virtual Machine Scale-Sets to reduce overhead, speed up recovery, and minimize additional costs incurred during recovery.
Zerto now provides a DR and backup solution with the addition of long-term retention (LTR). This puts data protection on a continuum from short-term retention (to enable very short RPOs for DR) to LTR (to meet traditional backup requirements).
Understand what you can get from a BCM tool, and then evaluate based on your specific requirements. Due to the maturity of the market, many products will check your boxes, so your evaluation will often come down to usability and cost.
Fusion has an out-of-the-box connector with Everbridge. This is part of a larger trend for the SaaS BCM market. Built in APIs have become a major focus for product development as business continuity managers struggle with juggling multiple tools and integrating large amounts of data.
Adobe’s revenues grew at a rate of 25% to $2.6 billion in the most recent quarter, placing the company on an annualized run rate of about $10 billion! The Magento (e-commerce) and Marketo (B2B marketing) acquisitions bolstered the digital experience segment while continued strong organic growth in Creative Cloud and Document Cloud powered the digital media market.