Updated: Apr 21, 2022
Welcome back to this Toronto Specific data engineering project.
We left off last time concluding finance has the largest demand for data engineers who have skills with AWS, and sketched out what our data ingestion pipeline will look like.
I began building out the data ingestion pipeline by launching an EC2 instance. I should note that if you have created an AWS account, but have not yet created an Identity Access Management (IAM) admin role, and are therefore still using root credentials, I am strongly urging you now to set that up before moving forward. To illustrate the importance of IAM roles and root credentials; consider the following:
The Master Key
Giving away root credentials to engineers is the same as giving a person the master key to your home. They can do whatever they want, whenever they want.
You wouldn't let all the people in your city or all over the world have access to your belongings, would you?
If you only required an engineer to perform a minor task, why give them access to all levels of the account?
And so we create IAM roles that have specific permissions to limit what a person can do when logged in. Even though this is a personal project and I intend to be the only one working on it, it is still a best practice to create an admin IAM role, and stop using root credentials as soon as you can.
EC2 & Session Manager
Now that you have created an admin IAM role and have stopped using root credentials (seriously, do it now), we can resume with launching the instance.
There are a few ways AWS will let you access an EC2 instance once it is launched. You can SSH into it, use EC2 Instance Connect (a browser based SSH connection), or use Session Manager.
I have chosen to use Session Manager as it creates a log of who accessed the instance. Note that to use session manager requires an additional IAM role permission, which can be set up when launching the instance.
While I have already created the role 'MyEC2Role', you can do the same by clicking beside it on "Create New IAM Role".
1. click create role
2.Select EC2 and click next permissions
3. select the ssm role
You'll have the option to add tags to describe the role as well, but in a simple project in a brand new account like this I have opted not to do so.
If you don't see your new SSM role (which I named 'MyEC2Role'), click the refresh to the left of "Create New IAM Role". Then look in the IAM Role drop down menu and it will be available to you to select.
Great news, now that you've selected that IAM role, we can connect via session manager.
After clicking 'Connect', a new browser tab will open and you'll be able to access the instance. I should note that logging in via session manager means you'll have to do the additional step of using the following command
sudo su - ec2-user
I'll be documenting how I install spark and kafka on this instance.
You can find my linked in here: https://www.linkedin.com/in/steven-aranibar-8891a2103/
And you can contact me at email@example.com