So I was curious and wanted to know what are the tools of choice these days for on-prem and cloud based platforms.
Based on my research, I found out the following:
For cloud, it might be platform dependent.
For on-prem hive,spark,nifi,airflow,kafka seem to be good.
I would like to know which tools are in demand in your opinon. Also, would be nice if some comments on how docker, kubernetes conncect with these data tools. i heard the enw version of spark supports both along with yarn.
Hi Usman,
the best tools is always a difficult question, because they depend on the use case you have. In general I found that as you write, some tools like Spark, Kafka and Airflow are in very high demand.
They are very versatile and therefore can be used for a wide variety of tasks. Especially the combination of Spark and Kafka are great to build streaming pipelines.
Spark in general is very useful. The thing is always with these parallel processing frameworks that they need a source that can handle and is optimized for that load.
Hadoop and HDFS in general is a great combination with Spark. Hadoop can be the storage layer and Spark does the processing. It's really awesome.
For Hadoop: Once you already have everything in place it's super simple to setup a Hive data warehouse.
Airflow is also one of these tools for scheduling that Data Scientists love.
It's great to schedule jobs with it also useful for Engineers.
I like tools like Nifi or Streamsets (commercial tool) a lot for building pipelines. It makes it very easy to manage and monitor the data flows on your platform.
Thank you @Andreas Kretz for the reply. What about the concepts and techniques? Like what do you think are the basic essentials every data engineer should know? for example: Star, snowflake schema, DWH indexes etc Design patterns in general
Languages like python, java, scala
Which conepts and languages do you think are essential?