3 ways to Test Operational Resilience
Failure is common in complex business processes. In recent times, there has been serious process failures from the Charity field (Oxfam, Save the Children) in addition to various banking systems failures (RBS, TSB). In today’s modern world, complexities are increasing and this means that the opportunities for failure are far greater. Most organisations fail to build systems and processes that are operationally resilient either to internal or external risk.
The Oxfam example: Recently we have seen catastrophic failure occur with Oxfam, where years of poor processes resulted in ongoing serious misconduct by staff – the regulator said that the company had poor systems in place for safeguarding and talked of “systemic weaknesses”. Looking at the bigger picture, the main issue that caused these ‘systematic weaknesses’ was the organisations culture which was patriarchal towards those it was trying to help. Complex systems cannot contain rules of how the whole system works (e.g. culture cannot be codified).
“Oxfam systems and processes are important but not sufficient. Unwritten ways of working, hidden and informal power, and the culture of an organisation all underlie the current challenges and solutions.” (1)
Catastrophic failures sometimes occur because the downfalls are seen as one off or single point of failure and not systematic. At Oxfam, there was evidence of failure around culture and values, but these were not addressed and subsequently misconduct incidents in Haiti were seen as rogue rather than a failure of operational processes.
The RBS Example: In 2012, there was a major systems failure at the Royal Bank of Scotland that left millions of people unable to access money for four days last week. RBS Group paid £175 million in compensation. It seems that its processes of dealing with new software for batch tasks that go wrong were not significantly tested, and that inexperience (an element of a complex systems) has created this issue.
Often, software changes to a process are not seen in the whole but instead regarded as individual ‘sub-system’. However, complexity in a system requires the sub-systems to work in cohesion. The problem of failure is often due to the interaction between the sub systems rather than each sub-system itself. So, in this case the new software and the batch tasks failed as a whole rather than individually.
“While staff had successfully tested the new software, it had not done so for the patched version that it actually implemented.”
Running End to End Scenarios
One way to test the whole system and the interaction between elements of the system is to use end to end scenarios. With complex systems, we can never understand all the inner functionality of the systems. This means that we want to test whether the whole system is resilient, using hypothesis and models as a start point. Technology, processes and people’s complexity is tested at the same time.
Before running an end-to-end process test, ask these questions:
- How can the process be tested to get real insight? – Which scenarios work best?
- What is the start and end point of your process test? Are they starting from the right point and are we testing the whole end point?
- How can you test the process to bring out (a) point(s) of failure?
For new services, operational testing is often considered at the end of the design, rather than as an integral part of it. At the end of the design, changes can be very costly, the points of leverage of change are smaller and so makes things more difficult to resolve.
So testing is best carried out throughout the design. There is an implied contradiction here – how can we test a design without it being complete? Iterative testing during design gives a better insight of interactions, some key lessons. As the systems build gets bigger and more complex, these key lessons help the overall tuning of the system including the design.
Here are some way to set up test of complex processes:
- Day in the life: Putting oneself in the place of the user and use the system to see what obstacles confront them
- Modelling and hypotheses: Creating a model of the whole system, that allows assumptions and joins in the system to be visually represented.
- ‘What If’ analysis: Considering all variations of inputs to see if the system can adapt and be operationally resilient. E.g:
- What if someone tried to commit fraud?
- What if demand patterns changed?
Throughout the test, in order to spot points of failure, keeping a log of all activities, customer testimonials and use of videos and transactional recording is vital. Don’t assume that just the analysis of data will spot something, some elements of failure can’t be seen directly with data.
Asking the Right Questions for Operational Resilience: Once the tests are complete, the following questions can be answered:
- How can you build greater resilience in the greatest areas of risks without committing resources in advance?
- What can you put in place to remove the risk completely?
- How will you continue to test complex processes in the future, as the external environment changes?
In conclusion, building operational resilience for complexity is a task that requires study and awareness. This is because the greater sources of errors are the failure between sub-systems as much as the failure of each system. Building good test scenarios and using iteration as way of learning helps build operational resilience that works for your individual complex process.
For more about managing complex change read the The Art of Transformational Change
Ketan Varia, with editorial support from Burcu Atay and Chloe Haimes
References
(1) Source: ‘Commitment to Change, Protecting People’. Independent Commission. 2019