Recently I read read about two outages, the AWS S3 being the bigger one. The other outage, being at GiLab.com. In both cases the root cause of the problem boiled down to human error. Even with tons and tons of automation around, we need to depend on System Operators to perform certain tasks and this is where human error gets induced. Also remember, not every automation tool is fool proof. You never know which corner condition it was not designed for and that could also induce problems. For now let us concentrate on human error.
I am sure each of the system administrator has his/her own horror story to related regarding human errors. I have known too many. I will tell you a few of them here.
When I worked for my company, in the late 80s, getting the root password was not a difficult thing. Lots of people had the root password for the systems. Once a sysadmin went to a lab of another department as he wanted to copy some files from there. He had root access on the system. After copying files, he some some unnecessary files in the system and gave rm -rf *.* Unfortunately he was not in the same directory where those unwanted files existed but at a directory at a higher level. Before he ould realize his mistake the system went down. It was later said that whenever the department people saw him coming that side, they would shut down all systems till he left the place.
This was a minor one as it impacted only system. The major one I heard of was in the private cloud segment, where they were hosting database as a service. It seems that one of the DB administrators had to manually connect the database to a client system. Unfortunately he connected the DB of another client instead of the correct one. So the first client was able to see the database of another company!! All hell broke loose and the client had to be pacified by people at the very top.
If you look at the GitLab.com case, you will see another standard horror story. People take backups but never test if the backups are good. A friend of mine related a story wherein some major design drawings were being backed up regularly. One day their servers crashed and became non recoverable. So they tried to restore from the backups only to find that though backup jobs were run daily there were failures which the sysadmin had not noticed. So there were nothing in the tapes. To add to their horror the sysadmin had quit only a few weeks before. So almost 6 months of effort had to be repeated !!
The more complex the system, the more impact any such error has. Additionally the complexity, as in the case of AWS, induces its own error checking and consistency checks, so that recovering from errors will not be an easy task.
The job of System Administrator will grow more and more tense with the evolving complexity of systems. The fact is that some of the best SysAdmins are chosen for such jobs and yet there could always be an instance wherein due to tiredness, temporary lack of focus, oversight or sheer bad luck an error could be made. Unfortunate in this cloud era, if you are a service provider, the repercussions are bound to be heavy. The System Administrators must be more vigilant than ever and the organizations need to put in lots of checks and balances and ofcourse automate wherever they can.
You can read about the AWS S3 outage and what was impacted, here: https://www.theregister.co.uk/2017/03/01/aws_s3_outage/
Here is an explanation of what how the AWS Outage happened: https://techcrunch.com/2017/03/02/aws-cloudsplains-what-happend-to-s3-storage-on-monday/
Here is a writeup on the GitLab.com outage: https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/