Passing the AWS Solution Architect Professional certification exam

 

Professional Certificate

If someone were to ask me how they should prepare for the AWS Solution Architect Professional exam, I would advice them not to prepare like I did. In the sense that I went to the exam quite under- prepared and I had to spend considerable time on each question in the initial stages before I got an hang of the questions. As the test progressed I was able to speed up my response.

I had taken a target of March end to complete this certification. My earlier Associate certification was expiring by March end and instead of getting re-certified I though I will attempt this certification. Unfortunately I got involved in getting my online courses ready (you should see them in a couple of month’s time) and didn’t have much time to prepare. Most preparation I did was in the last one week and I don’t think that is enough.

My friend Kalyan had sent me links to videos which need to be watched and also links to important white papers. Kalyan is a certified professional himself and these were helpful though I did not see all the videos and did not read all the white papers. What I did was to read the developer documents of most of the services and then depend on my logical ability to deduce the answer. This will backfire if you do not have a good grip on the services of AWS.

A few points from what I could gather from the exam:

1. Quite a few questions involve Big Data services: Kinesis, RedShift, Elastic Cache and EMR. So understand these services well. You must know when to use which service

2. I got a few questions on SWF and Datapipeline. Again you need to understand which is used for which situation

3. Lot of questions on hybrid cloud. So be very thorough with Direct Connect, VPN and Route 53

4. Lot of questions about costs which involved CloudFront, S3, Glacier

5. Understand when you must use RDS and when you must use DynamoDB. Quite a few questions have both these services as answers

6. Understand the difference between Layer 4 and Layer 7 in Networking

7. If you know your theory well, you can easily discard some of the options. This is the approach I used in most of the questions. To paraphrase Sherlock Holmes, “Remove all the impossible answers. Whatever remains, however improbable, must be true”

The major problem with this exam will be that you may not have used many of the services. Many of us will not have a chance to use Direct Connect or VPN or RedShift or Elastic Cache and so on. So we must rely on theory and an understanding of these services to answer the questions. Therefore it is imperative that you read the documentation in detail and watch the 300 and 400 series videos to understand the theory thoroughly. A good understanding of the theory couple with good analytical reasoning skills will let us cross the line.

All the best if you are trying for this certification.

Human Errors and the burden on SysOps engineer

300px-Paris_Tuileries_Garden_Facepalm_statue

Recently I read read about two outages, the AWS S3 being the bigger one. The other outage, being at GiLab.com. In both cases the root cause of the problem boiled down to human error. Even with tons and tons of automation around, we need to depend on System Operators to perform certain tasks and this is where human error gets induced. Also remember, not every automation tool is fool proof. You never know which corner condition it was not designed for and that could also induce problems. For now let us concentrate on human error.

I am sure each of the system administrator has his/her own horror story to related regarding human errors. I have known too many. I will tell you a few of them here.

When I worked for my company, in the late 80s, getting the root password was not a difficult thing. Lots of people had the root password for the systems. Once a sysadmin went to a lab of another department as he wanted to copy some files from there. He had root access on the system. After copying files, he some some unnecessary files in the system and gave rm -rf *.*  Unfortunately he was not in the same directory where those unwanted files existed but at a directory at a higher level. Before he ould realize his mistake the system went down. It was later said that whenever the department people saw him coming that side, they would shut down all systems till he left the place.

This was a minor one as it impacted only system. The major one I heard of was in the private cloud segment, where they were hosting database as a service. It seems that one of the DB administrators had to manually connect the database to a client system. Unfortunately he connected the DB of another client instead of the correct one. So the first client was able to see the database of another company!! All hell broke loose and the client had to be pacified by people at the very top.

If you look at the GitLab.com case, you will see another standard horror story. People take backups but never test if the backups are good. A friend of mine related a story wherein some major design drawings were being backed up regularly. One day their servers crashed and became non recoverable. So they tried to restore from the backups only to find that though backup jobs were run daily there were failures which the sysadmin had not noticed. So there were nothing in the tapes. To add to their horror the sysadmin had quit only a few weeks before. So almost 6 months of effort had to be repeated !!

The more complex the system, the more impact any such error has. Additionally the complexity, as in the case of AWS, induces its own error checking and consistency checks, so that recovering from errors will not be an easy task.

The job of System Administrator will grow more and more tense with the evolving complexity of systems. The fact is that some of the best SysAdmins are chosen for such jobs and yet there could always be an instance wherein due to tiredness, temporary lack of focus, oversight or sheer bad luck an error could be made. Unfortunate in this cloud era, if you are a service provider, the repercussions are bound to be heavy. The System Administrators must be more vigilant than ever and the organizations need to put in lots of checks and balances and ofcourse automate wherever they can.

You can read about the AWS S3 outage and what was impacted, here:  https://www.theregister.co.uk/2017/03/01/aws_s3_outage/ 

Here is an explanation of what how the AWS Outage happened:  https://techcrunch.com/2017/03/02/aws-cloudsplains-what-happend-to-s3-storage-on-monday/

Here is a writeup on the GitLab.com outage:  https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/