The Amazon Simple Storage Service and Amazon Web Services, also known as S3 and AWS, happen to power a lot of the internet as you know it. And they’re usually quite dependable, meaning that the companies using the solutions can rest assured that their websites will be up no matter what. But no internet service can have a 100% uptime rating, and the same goes for AWS. It was Amazon’s AWS that failed on Tuesday, taking plenty of popular services and sites down with it.
Now, Amazon has finally explained exactly what happened on Tuesday when it took down much of the web.
In lengthy and rather technical note to customers, Amazon explained that an employee entered a command incorrectly and caused a chain of events that ultimately led to parts of the internet going bonkers. Amazon has protocols in place to fix downtime issues like these, but the problem was so ample that not even Amazon could fix it in a timely matter. That’s why the outage lasted for several hours — well, technically it wasn’t an outage since only subsets of AWS’s complex architecture went down.
“We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes,” Amazon said. “While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.”
The good news is that Amazon was able to restore order in the universe, and it’s now looking to prevent this from ever happening again. If you want to read the entire explanation, check out Amazon’s full post here.