Hello! I was hoping to get some general guidance on what to learn first. I recently accepted a position at a software company as a Site Reliability Engineer. I will be supporting their analytics product which heavily relies upon the following technologies:
- Map Reduce
- Spark
- Hive
- Kafka
- Druid
- HDInsights My question is: Given the tools/technologies above, what order should I learn these in to best understand how they all work together? I have found some good resources to learn the majority (with the exception of Hive and Druid) but need some guidance on what to learn first. Essentially, I am just looking to establish familiarity with these products so I have a good foundation going in. I have about 10 days to study these topics before I start. Thank you!
Hey Steven, sorry for the late answer! I did not check the forum for some time.
I hope this is still relevant:
MapReduce in general is very old. Almost nobody uses it these days. The pattern of mapping and reducing however is still used for instance in Spark. So look into how these two stages work together.
I would start with Kafka, because it's very often the source of data for your processing.
Then look into processing with Spark using Kafka as data source and maybe as sink.
With HDInsight I figure that you work on a Hadoop style system. Learn how HDFS works and how you can set up a Hive data warehouse schema based on the data in HDFS. This way you can query it with an odbc/jdbc connection.
I'm guessing in your case Hive is set up with Spark as execution engine. This way it uses Spark for any queries against Hive.