April 25, 2011
Late last year, the Netflix Tech Blog wrote about five lessons they learned moving to Amazon Web Services. AWS is, of course, the preeminent provider of so-called "cloud computing", so this can essentially be read as key advice for any website considering a move to the cloud. And it's great advice, too. Here's the one bit that struck me as most essential:
We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.
If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
Which, let's face it, seems like insane advice at first glance. I'm not sure many companies even understand why this would be a good idea, much less have the guts to attempt it. Raise your hand if where you work, someone deployed a daemon or service that randomly kills servers and processes in your server farm.
Now raise your other hand if that person is still employed by your company.
Who in their right mind would willingly choose to work with a Chaos Monkey?
Sometimes you don't get a choice; the Chaos Monkey chooses you. At Stack Exchange, we struggled for months with a bizarre problem. Every few days, one of the servers in the Oregon web farm would simply stop responding to all external network requests. No reason, no rationale, and no recovery except for a slow, excruciating shutdown sequence requiring the server to bluescreen before it would reboot.
We spent months -- literally months -- chasing this problem down. We walked the list of everything we could think of to solve it, and then some:
- swapping network ports
- replacing network cables
- a different switch
- multiple versions of the network driver
- tweaking OS and driver level network settings
- simplifying our network configuration and removing TProxy for more traditional
- switching virtualization providers
- changing our TCP/IP host model
- getting Kernel hotfixes and applying them
- involving high-level vendor support teams
- some other stuff that I've now forgotten because I blacked out from the pain
At one point in this saga our team almost came to blows because we were so frustrated. (Well, as close to "blows" as a remote team can get over Skype, but you know what I mean.) Can you blame us? Every few days, one of our servers -- no telling which one -- would randomly wink off the network. The Chaos Monkey strikes again!
Even in our time of greatest frustration, I realized that there was a positive side to all this:
- Where we had one server performing an essential function, we switched to two.
- If we didn't have a sensible fallback for something, we created one.
- We removed dependencies all over the place, paring down to the absolute minimum we required to run.
- We implemented workarounds to stay running at all times, even when services we previously considered essential were suddenly no longer available.
Every week that went by, we made our system a tiny bit more redundant, because we had to. Despite the ongoing pain, it became clear that Chaos Monkey was actually doing us a big favor by forcing us to become extremely resilient. Not tomorrow, not someday, not at some indeterminate "we'll get to it eventually" point in the future, but right now where it hurts.
Now, none of this is new news; our problem is long since solved, and the Netflix Tech Blog article I'm referring to was posted last year. I've been meaning to write about it, but I've been a little busy
. Maybe the timing is prophetic; AWS had a huge multi-day outage last week
, which took several major websites down, along with a constellation of smaller sites.
Notably absent from that list of affected AWS sites? Netflix.
When you work with the Chaos Monkey, you quickly learn that everything happens for a reason. Except for those things which happen completely randomly. And that's why, even though it sounds crazy, the best way to avoid failure is to fail constantly.
(update: Netflix released their version of Chaos Monkey on GitHub. Try it out!)
Posted by Jeff Atwood
Sounds like a great model. I think Gooogle had the same idea for their distributed file system right? Assume failure.
Incidentally, the people who used distributed file systems hardly noticed the outage. If you market something as indestructible (or close to it), you'd be amazed at how infrastructure is designed around that assumption.
We have something very similar (we call it drunken monkey) that randomly shuts OVS vlans and does 'bad' things to our API.
As far as storage goes, and the people who plan around it .. united we fail, but sometimes in a good way.
"If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine."
Great idea for a distributed system, but when you have the whole kitten kaboodle deployed to AWS and AWS has issues chances are all your systems will have issues too!
Very good advice Jeff, and a very interesting read. Luckily I don't need to deploy anything... yet. But the advice will come in handy some day, I'm sure.
PS: Looks like when signing with Blogger your name does not appear in the comment by default?
I feel the same way about bugs. They suck finding but in the end you almost always come out with a better system (besides obviously/hopefully fixing the bug). Whether it be a better architecture, logging, personal understanding, whatever.
BTW Michael, the phrase is "kit and caboodle", though "kitten" is a hilarious homonym for that. :)
I'm reminded that what seems like a failure in Windows, the "reboot to fix it" mindset, is a similar advantage. If you pull the plug on more ostensibly stable OSs you have much higher change of ending up with an even bigger mess on your hands.
This reminds me of the “crash-only software” concept — that is, avoid writing a “shutdown” mechanism and instead ensure the system can restart when terminated at any point. The idea being that then your recovery system isn't a rarely-invoked special case, so it is more likely to work when you need it (and it has pressure to be efficient), and also that you don't have the cost of performing the shutdown when you need to.
Or, failure and function continually engender each other. Code is Poetry, says WordPress, and I agree.
Yes, yes, a thousand times yes. Distributed systems that rely on all the pieces being up all the time are simply at odds with reality. Every interaction with someone else can result in a success, a failure, a rejection, or your request simply getting lost. Your design doesn't have to accept that fact, but failing to design for it doesn't make it go away. I wrote about this two years ago:
When you build internet-scale distributed systems, you should always assume you are in flaky connection mode. Maybe the tubes are down today. Maybe your vendor’s server went down. Even with all the contracts and SLAs and angry phone calls in the world, you fundamentally don’t have any control over that box staying up and reachable when you need it.
Like Morgan Tiley in the first comment, this was also reminding me of Google's approach to distributed systems. There, the collections of systems are big enough that you will have a chaos monkey just from hardware failures, so you have to build the system to deal with that -- at which point, they famously asked why use expensive high-reliability hardware when the cheap stuff is vastly cheaper and only less-than-vastly less reliable?
Seems to work well for them. And, on the face of it, hardware failures are much less friendly than an artificial Chaos Monkey that you can simply reboot from.
It's definitely an interesting approach to include a mild one voluntarily, though!
Good advice. Regarding the server that was causing trouble: I've dealt with a server that had very similar symptoms. Weeks of troubleshooting that led nowhere resulted in me throwing my hands up and assuming the motherboard itself was just bad and swapping hardware with an available hot spare. Problem solved. Months later after trying to redeploy the original server, I realized the DRAC card was both faulty and misconfigured, causing the aforementioned nightmare. Card removed, problem solved.
@JeffAtwood Don't keep us in suspense -- what was the root cause of the dropped server problem? (The Broadcom NIC thing you've mentioned before?)
Actually AWS appears to still be partially down. metabase.cpantesters.org
is still down. Which has crippled the ENTIRE Perl testing infrastructure.
- No one can upload new test reports.
- CPAN authors aren't getting up-to-date reports of failure.
- Reports from people testing distributions against the latest version of Perl, are not getting through. Which means that we can only hope that there aren't any new failures that aren't getting sent to the mailing list. ( Perl v5.14.0 may come out on April 28th )
- CPAN authors may be leery of putting out new versions of their modules while this black-hole exists. (I know I am)
This is not the first time that there was a problem with sending reports. Last time there was a simple work-around. This time, the only work-around is to setup a relay server, and put it into offline mode until further notice.
I was gonna post this link to reddit, an AWS site, but it was down. Now it's back up. It's like a Katy Perry song or something.
To volontary kill services and shut down servers is hardcore testing but this seems a great way to stress web app, softwares and computers.
In the mid 70s, Dick Morse (Mrs. Morse's sone Helmut a he was dubbed by Hugh Rundell) and I talked through the idea of having software fire drills built into systems.
The idea, at the time, was that there were points in protocols where errors could be injected to ensure that the recovery procedures worked and also that operators saw them enough (but always recoverable) to know what it was like (and avoid the Maytag repairman syndrome).
I actually designed a real-time subsystem for operating multiple terminals off of a Xerox 530 minicomputer in which there were fire-drill points.
It was a valuable design exercise but I never needed to pull a fire drill.
It happened that there were some heuristics for estimating the size of data blocks needed to satisfy a terminal request or response that would guess wrong often enough that the recovery code for that was exercised regularly enough and it was visible (to those who knew what was happening) and it recovered properly. Meanwhile, cases of dropped responses from the controller, a situation that could have been injected, happened often enough that we never had to do that. We didexpose a problem in the hardware architecture, however. The terminal controller was on the other side of a cheapo-adapter that provided no way for the minicomputer to force a reset of the controller. So if the controller (or the adapter) went autistic, all we know was all of our requests were timing out and all we could do was slowly shut down all of the sessions as if the terminal operators had simply all walked away without logging off.
My interest in this kind of fire drill was inspired by an earlier experience in the late 60s when Sperry Univac was building a System/360 semi-clone. (It was not plug compatible, and it could some of use the same devices but not the operating system). In the test center when early production machines were being used to develop the operating system, including all of the device drivers, IBM disk drives were being used until we had delivery of our own. Everything was going along great until newly-manufactured competitive drives were installed. These drives were not so reliable and the OS started crashing, because the error recovery paths in the drivers had never been exercised and they failed.
This reminds me of an article I read a while ago about custom JVM with a high-speed garbage collector.
We didn’t take the typical approach where you try and optimize for the common fast case, but remain stuck with some things that are really hard to do, which you push into the future. Then you tune and tune to make those events rare, maybe once every ten minutes or every hour—but they are going to happen. We took the opposite approach. We figured that to have a smooth, wide operating range and high scalability we pretty much have to solve the hardest problem all the time. If we do that well, then the rest doesn’t matter. Our collector really does the only hard thing in garbage collection, but it does it all the time. It compacts the heap all the time and moves objects all the time, but it does it concurrently without stopping the application. That’s the unique trick in it, I’d say, a trick that current commercial collectors in Java SE just don’t do.
Pretty much every collector out there today will take the approach of trying to find all the efficient things to do without moving objects around, and delaying the moving of objects around—or at least the old objects around—as much as possible. If you eventually end up having to move the objects around because you’ve fragmented the heap and you have to compact memory, then you pause to do that. That’s the big, bad pause everybody sees when you see a full GC pause even on a mostly concurrent collector. They’re mostly concurrent because eventually they have to compact the heap. It’s unavoidable.
Our collector is different. The only way it ever collects is to compact the heap. It’s the only thing we ever do. As a result, we basically never have a rare event. We will compact the young generation concurrently all the time. We will compact the old generation concurrently all the time. It’s the only thing we know how to do. And we do it well.
So if you have a hard case that you can delay but never completely avoid, try spiting sanity and making it more common – ubiquitous in fact – rather than less.
I'm a programmer, but for a while all I did was web design which I left because most aren't programmers and the code they write was ugly, slow and overall the worst of worst practises.
We did use PHP and I do know they built in redundancy in case a database couldn't be accessed and whatever else they do.
Now back as a programmer, I adhere to this quite well. I do think it has made me a more thoughtful programmer. I consider things like calling a function with the wrong types which happens more often than not. What if a module cannot be accessed... all this stuff that really shouldn't go wrong, but can, especially when the app is accessing internal server farms to retrieve information.
But...just a month ago, Netflix was down. A "rare technical issue".
I think the problem here is that entropy constantly increases. How many processes does the Chaos Monkey kill? Does it run more often over time? You have to fail more over time to compensate for the law of entropy.
Following up from @Kevin Reid: http://dslab.epfl.ch/pubs/crashonly/ is the core paper. "There is only one way to stop such software – by crashing it – and only one way to bring it up – by initiating recovery."
There is a downside to all of this, though. While reliability is a good thing, it's not free. Chaos Monkey may make a system very robust but the time and expense that it imposes may be more than the occasional downtime it prevents. Obviously, this will vary from system to system, but for every addition 9 added to uptime there is something else that must be foregone.
That makes sense. It reminds me of graceful degradation as far as web features go or assuming users will put something vial in contact forms...always prep your system for failure. I'll keep that in mind!
We've been designing software with this idea in mind for over 10 years. When you design with the idea of "everything fails, deal with it", your software becomes much more robust. Not only does it become more robust, but the maintenance becomes easier as well. Need to take down a server? Who cares, just do it, the system will respond properly. Database died? Who cares, just go back to sleep and deal with it at a reasonable hour.
There is a downside however. You must make sure you have proper monitoring systems in place so that you will know when something has failed unexpectedly. Otherwise you could have your system running along in a potentially degraded state and you're not aware of it.
It definitely takes all the stress out of managing a large system.
"The best way to avoid failure is to fail constantly"
"The best way to avoid failure is to fail constantly"
"The best way to avoid failure is to fail constantly"
I definitely agree with Corey. What was the issue Jeff ?
The suspense is killing me!
Talk about a cliffhanger.
Please tell us what was the cause of server problems. It is causing us headache when you only tell us half the story.
This is great advice not just for the technical field but for life in general. Practice skills until perfected. Practice in rough conditions, while handicapped, or both. When required to perform (during concerts, sporting events, survival situations, or in this case on the internet) you will be adequately prepared.
Incidentally, the Android developer kit includes a program called Monkey which generates random stream of user events:
I like this idea of using a Monkey for user facing apps :)
But for background processes, I would say it depends. Since the incidence of calamities is rare, the cost to benefit ratio would be different for each company / application.
Go double redundancy!
I agree with Edward Chick and am also very curious.
Wow, the Chaos Monkey keeps everyone honest. It is the ultimate environment that many enterprises are striving for, although many still struggle to understand the basics of high availability and where single points of failure still fester. I do sympathise with some of the folks hit by the Amazon EC2 outage, despite that fact that in an *ideal world* they could have avoided their fate. Sometimes the best learning happens by failing aka "fast failing" management fashion du jour.
I'd think of the Chaos Monkey as the architecture that everyone should be building towards or aiming for. Only in the cloud could you even discuss an architecture like that.
My discussion of the Amazon outage: http://www.iheavy.com/2011/04/26/amazon-ec2-outage-failures-lessons-and-cloud-deployments/
Interesting, but let's remember it was also a chaos monkey who caused the Chernobyl disaster. They simulated a power outage to test the stability of their systems: http://en.wikipedia.org/wiki/Chernobyl_disaster
So I guess as with everything, we always have to balance out things to find a good middle ground.
(I work at Netflix; and this should not be considered in any way official)
One point worth making about the need to enhance Chaos Monkey is that killing instances isn't enough. Some of the most interesting (in a bad way) issues we've seen involved instances that got into a weird state (e.g. instance is still up as far as an ASG is concerned but not up as far as an ELB is concerned). As you note, Chaos Monkey by itself is important, but the next step is to find ways to mess with your environment that are more complex than simply a clean death for your instances.
Of course, sometimes we get to have our extreme scenario testing done for us, for free. Like when Amazon messes up their EBS environment ...
Boeing started using a technique called FTA (Fault Tree Analysis) circa 1966 when designing civil aircraft. It's basically a formal method to ensure no system is left without backup. While the "Chaos Monkey" is a neat idea, I'd never design a critical system without FTA.
@uala Thank you...I remembered working with a similar tool as the Monkeylives back in my palm days, couldn't remember the name. Gremlins it was.
How about a similar thing on the source-code control side?
A test coverage tool will mutate your codebase, e.g. changing a '<' operator into a '<='. It does this in order to check that one (and ideally just one) test fails as a result.
To this conventional mix, add the idea that if we find a code mutation which does not cause any tests to fail, then the mutated code is automatically committed back into our repo.
This script should be runs over production code unpredictably, several times per week.
It will teach your development team to write thorough tests, with great coverage! Or else!!!
Crown-Sat is a China based professional hi-tech company engaged in R&D, production and distribution of consumer electronics products, such as automotive, office and daily using electronics. As one of the best manufacturer and wholesaler in Shenzhen, we had good relationship with our customers all over the world. With the faith of “making friends before doing business”, we had almost exported our goods to the overseas market such as USA, Mexico, Germany, France, UK, Spain, Portugal, Sweden, Russia, Japan, Singapore, Malaysia etc.
"High Quality, Competitive Price, On-Time Delivery, and Good After-Sale Service" are our principle. We have received good reputation and support from our customers, promoting our development in these fields as well. Up to now, we have provided a large quantity of digital products such as the DVB, mini projector, solar related and cartronics, and the most hot sell products are Openbox S9 HD PVR,SkyBox S9 HD PVR ,Dreambox DM500 HD,
Good story. As a side note, it makes me remember my old PalmOS Developer days, with one of the greatest tools I've ever worked: PalmOS Emulator with Gremlins!
cheap Ash running sneakers
"requiring the server to bluescreen before it would reboot."
I think I already found your problem and the solution. ;)
The academic field and the engineering practice of computer programming are both largely concerned with discovering and implementing the most efficient algorithms for a given class of problem. For this purpose, algorithms are classified into orders using so-called Big O notation, O(n), which expresses resource use, such as execution time or memory consumption, in terms of the size of an input. Expert programmers are familiar with a variety of well-established algorithms and their respective complexities and use this knowledge to choose algorithms that are best suited to the circumstances.
Alexis (Web Design Company)
Any tips for this kind of redundancy implementation and testing with limited memory and throughput? (Specifically embedded systems). I mean if I had an infinite amount of memory and hardware, it would be easier to add much more error handling/checking, but in an embedded system, you're limited by both storage and speeds. Any tips for kinds of systems to improve? (not just massive server type systems basically)