WTF   //   April 10, 2023

WTF are chaos monkeys?

Chaos monkey is a term that’s likely familiar to software teams, where it’s known as a tool to test the resilience of IT infrastructures. And with our reliance on technology in the workplace only increasing, ensuring it can perpetually sustain itself is more critical than ever.

But it’s also being applied in a different, broader way across some businesses – to describe either major mindset changes or culture overhauls.

For example, Shopify, the Canadian multinational commerce company, launched its first so-called chaos monkey in early January, when it culled a substantial number of meetings from the calendars of its 11,600 employees. 

Deann Evans, Shopify’s director of EMEA partnerships and expansion, told WorkLife that following the enforced changes, time spent in meetings was down by 33% per employee in the first two months of the year compared to the same period in 2022. And yet, the change, caused a period of havoc for teams as they got to grips with such a different daily structure.

For those who are unfamiliar with the term, here’s an explainer.

So WTF is a chaos monkey, exactly?

Well, firstly, it doesn’t need bananas to survive. And it doesn’t look like a monkey at all. But it does share a simian tendency to wreak havoc and make mischief. That, though, is the point.

“The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption,” wrote Netflix’s Yury Izrailevsky, then the director of cloud and systems infrastructure, and Ariel Tseitlin, former director of cloud solutions at the streaming company, in a July 2011 blog that initially unleashed the beast.

Netflix engineers invented the chaos monkey?

That’s correct. In the post, titled The Netflix Simian Army, the authors explained how they built the “tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.”

To illustrate the philosophy, they used the analogy of having a flat tire while in a car. “Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right?”

The blog answered these important puzzlers: “One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.”

Right, its about testing resilience, then?

Yes. Disabling production instances on its Amazon Web Services infrastructure at random uncovered vulnerabilities that the Netflix engineers could then fix by building better solutions.

“By running chaos monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them,” the blog added. “So next time an instance fails at 3 a.m. on a Sunday, we won’t even notice.”

Alison Watson, head of the school of leadership and management at Arden University in the U.K., said chaos engineering created a can-do culture in a crisis. “These experiments help teams build muscle memory in resolving outages – similar to how we run test fire drills,” she said. “By breaking things on purpose, we surface unknown issues that could impact a commonly used system. For businesses, this provides vital insight which can help alleviate the customer experience.”

And given the pace of digitalization, its vital to ensure things are working correctly, more than ever, presumably?

Indeed. California-based Sesh Tirumala, CIO of digital operations management firm PagerDuty, is a fan of chaos monkeys and regularly lets them loose. He argued that although wreaking havoc with usual ways of working might be tiresome for employees, it would be worth it in the longer term. With change the only constant, being agile enough to cope with anything is the goal. 

However, Tirumala stressed the importance of a “principled” methodology to unleashing chaos monkeys. “The risks to the business, to its people, customers, even cash flow, are too high for amateurish approaches,” he warned.

Should every organization let loose a chaos monkey?

No, according to Jaco Vermeulen, CTO of digital transformation consultancy BML Digital. “Chaos monkeys might work for some organizations, but not all,” he said. In particular, resilience testing is to be expected in the technology industry. “Randomized spot-checks on points of failure are the hallmarks of pure tech companies,” he added. “Continually testing the agility and composability is vital to ensure these companies can experiment and function.”

Not every industry or business requires this level of testing, though. So releasing a chaos monkey would be counter-productive. “Any company with relatively normal operations, where there are potential single points of failure, will not embrace the idea,” added Vermeulen.

“It is far too disruptive.”