Hi! I've had a tough time coming up with a good side project to practice data engineering concepts. I'm thinking about something that will require building a data pipeline and using technologies like Hive, Spark and Redshift. Currently I'm working at a company that uses these technologies but the use is justified since we handle a lot of data. I've been involved in the design and development of these pipelines, hence my interest in data engineering Other branches of software development have, I think, more straight forward options when it comes to side projects. For example, if you want to go into mobile development then a small, simple app would be a good side project. Similar options are available for game dev, backend dev, frontend dev, adops and more. The closest I've found is more closely related to machine learning and involves consuming large one-off data sets from places like registry.opendata.aws or AWS Marketplace. Is my approach wrong? Should I first come up with an interesting challenge or question and then fit data engineering into the solution?
So I was curious and wanted to know what are the tools of choice these days for on-prem and cloud based platforms. Based on my research, I found out the following: For cloud, it might be platform dependent. For on-prem hive,spark,nifi,airflow,kafka seem to be good. I would like to know which tools are in demand in your opinon. Also, would be nice if some comments on how docker, kubernetes conncect with these data tools. i heard the enw version of spark supports both along with yarn.