Amazon Web Services (AWS) is the Titanic of cloud hosting. It provides on-demand cloud computing platforms to both individuals, companies, and governments, on a paid subscription basis. The platform is designed as a backup to the backups’ backups that prevents hosted websites – including some of the largest in the world – and applications from failing.
Yet, like the Titanic, AWS crashed in April 2011, taking with it popular websites like Reddit, Quora, FourSquare, HootSuite, and New York Times, among many others, for four days.
It faced another major outage in February 2017, which again brought a large number of key websites down on their knees.
There was, however, one site that kept chugging along well during both these instances, despite also having AWS as its host at both the occasions.
This was Netflix, the world’s leading streaming video website and one that owns a dominant share of downstream Internet traffic – almost 35%; double of YouTube – in North America during peak evening hours.
Before we understand how Netflix survived this Internet debacle, let’s understand a bit about the cloud.
The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), companies need to design a cloud architecture where individual components can fail without affecting the availability of their entire system. In effect, a company’s cloud architecture needs to be stronger than its weakest link. And it must constantly test its ability to survive these “once in a blue moon” failures, like what happened in the form of AWS outages.
Despite the 99.99 percent availability that AWS’s agreement promises, when you are on the cloud, you must believe in Murphy’s Law, “Anything that can go wrong, will go wrong.”
So, what helped Netflix survive these outages when other large sites hosted on AWS faced blackouts?
It was seemingly Netflix’s deep faith in Murphy’s Law, and thus the creation of a simian army termed the Chaos Monkey.
Chaos Monkey is a tool internally developed at Netflix that comes from the idea of unleashing a wild monkey with a weapon in its data center (or cloud) to randomly “chew through cables” thereby disrupting its system. In simpler words, the Chaos Monkey is a bug deliberately activated into Netflix’s systems that make things go wrong with its service on a regular basis.
By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems (the Chaos Monkey gets activated only during normal business hours), Netflix has learned the lessons about weaknesses of its system, and thus build automatic recovery mechanisms to deal with them.
So, Netflix’s goal is to have the system so resilient that a failure at 4 am on a Sunday will not even be noticed.
That type of no-holds-barred testing can help unearth and resolve unknown issues before they become major outages. By having that constant idea that something’s going to break, Netflix has within their engineering department the mindset that they must make sure that no single point can take down the entire site.
Best Way to Avoid Failure
Netflix’s Chaos Monkey approach shows how the best defense against major unexpected failures is to fail often. By frequently causing failures, the company forces its services to be built in a way that is more resilient.
I see a great application of the Chaos Monkey approach in life and investing.
When the Chaos Monkey causes failures, Netflix engineers must respond well and treat such failures as opportunities to learn and improve.
They must answer these questions –
- How did this failure occur?
- What can be done to prevent it from happening again?
- How can we make our systems stronger by responding effectively to each failure?
By continually inducing failures in a blameless environment, and then methodically figuring out how to prevent the same failure from repeating, the Netflix team continually makes their systems harder to break.
Nassim Taleb’s concept of ‘antifragility’ comes to mind while I imagine the Chaos Monkey chewing wires and disrupting systems at Netflix.
Taleb writes in his book Antifragile: Things That Gain from Disorder –
Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.
In our culture so obsessed with success, failing often intentionally and embracing each failure like Netflix does is harder than it sounds.
In fact, life does not give us so many chances to fail and then come back stronger. Then, unlike Netflix that launches its Chaos Monkey during normal business hours, failure – in life or investing – does not strike at a time determined by us.
And then, repeating failures over and over will make us a big loser. Over time, repeated failures will make our whole life seem as brittle as glass.
The real trick to Taleb’s antifragility or Netflix’s Chaos Monkey is to ask yourself a few questions when failure happens –
- How could I have detected this sooner?
- Now that this has happened, how do we deal with it in a way that makes us stronger?
- What can I do to prevent it from happening again?
In dealing with these questions, the key character attributes I believe we need to survive the Chaos Monkey in life and investing are – preparation, flexibility, and acceptance.
We often have ambitious expectations from our goals, habits, and resolutions – in life and investing. But often, we ignore the fact that chance occurrences will disturb our best laid, most thought out, plans. And by trying to ignore this random factor, we become extremely vulnerable to them.
Life’s Chaos Monkey is very skilled at tripping up our best-intentioned goals, habits, and resolutions. Ask anyone. Ask yourself.
I recently challenged myself to ride my bicycle 21 km daily for 21 days. “Come what may,” I announced to my wife at the start of this challenge, “I will complete this challenge over the next 21 days.”
Then, on the 15th day, I lost my grandfather. Life’s Chaos Monkey hit me and my plan hard. This wasn’t a failure on my part. There was nothing to introspect. But you can see how the monkey chewed on my well-intentioned plan, apart from adding pain to me and my family’s personal life.
Failure = Opportunity
Antifragility happens when you accept your failure and take complete ownership of your failure with a blameless mindset.
Soichiro Honda, the founder of Honda Motor Company, said, “Success can only be achieved through repeated failure and introspection.”
If you use failure as an opportunity for introspection and re-learning, you don’t let yourself off the hook by blaming others. Instead, you learn about yourself.
Instead of fearing its random strikes, know that Chaos Monkey is here to help us. It helps us become antifragile. It helps us get stronger and smarter about our life so that we can survive even bigger monkeys in the future.
By the way, Netflix also employs an army of Chaos Gorillas who don’t just turn off individual servers, but occasionally wipe out an entire system, as if Godzilla had destroyed an entire portion of the country.
So, dealing with the Chaos Monkey’s constant appearance in our lives and in investing with preparation, flexibility, and acceptance helps us deal better with the Chaos Gorilla when it strikes us, and hard.
Before I end, let me share with you a few tricks I use in my investing (and life) to deal with the Chaos Monkey better. It still strikes me often, but I am prepared (I think) at most times.
- Stick with simple rules, process, and practices (and owning simple businesses; stuff that I thoroughly understand)
- Build in a margin of safety (no leverage, adequate diversification, ownership of high-quality businesses; knowing that the Chaos Monkey can strike anytime)
- predicting the future, because the future is random and thus unpredictable
- Accept that the Chaos Monkey will strike, and accept the reality when it really strikes (only then can I deal with it)
- Experiment small, like with only small amounts of money (I call it sin money)
- Avoid big risks (the Chaos Gorilla) that could wipe me out completely
- Keep my options open (like selling a business when the facts about the business change for the worse, and before the monkey becomes a gorilla)
- Focus more on avoiding things that don’t work than trying to find out what does work (Munger’s thought – “All I want to know is where I’m going to die so I’ll never go there.”)
- Respect the old (Taleb’s Lindy Effect) – learn the rules and practices that have stood the test of time, like the ones taught by Graham etc.
Jeff Bezos said in an interview in 2011 (emphasis mine) –
If everything you do needs to work on a three-year time horizon, then you’re competing against a lot of people. But if you’re willing to invest on a seven-year time horizon, you’re now competing against a fraction of those people, because very few companies are willing to do that. Just by lengthening the time horizon, you can engage in endeavors that you could never otherwise pursue.
At Amazon we like things to work in five to seven years. We’re willing to plant seeds, let them grow—and we’re very stubborn. We say we’re stubborn on vision and flexible on details.
In some cases, things are inevitable. The hard part is that you don’t know how long it might take, but you know it will happen if you’re patient enough.
The general underlying principle in dealing well with the Chaos Monkey is the same. You need to lengthen the time horizon and then play the long game, keep your options open and avoid total failure while trying lots of different things and maintaining an open mind.
Note: This post was originally published in the June 2017 issue of our premium newsletter – Value Investing Almanack (VIA). To read more such posts and other deep thoughts on value investing, business analysis and behavioral finance, click here to subscribe to VIA.
SAIMANOHAR P says
Adversity brings out the best in you. This post bolsters the fact. Cheers for all the chaos monkeys in real life.
Abhijith says
Good article, Helped me both professionally (Got to learn how can we use choas monkey in our highly cloud services dependent platform) and personally (On where and how to invest, especially this line “Focus more on avoiding things that don’t work than trying to find out what does work (Munger’s thought – “All I want to know is where I’m going to die so I’ll never go there.”)
Vishal Kataria says
Stellar article, Vishal. It’s in sync with the stoic philosophy that everything, including failure, is an opportunity.
Taleb’s questions, in the consulting world, can be summed up by the term CAPA = Corrective Action, Preventive Action.
Loved how you highlighted that failing like Netflix is harder than it sounds in today’s failure-obsessed world.
Mukul says
Completely agree with this. The only right attitude to deal with failures is to see them as an opportunity to learn and improve. Then apply those leanings to scale new heights.
Thanks
Mayur Kherdikar says
Brilliant article Vishal. Thank you for introducing me to Chaos Monkey theory read about it for the first time .