The screenshot of COVID-19 dashboard from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins
Often in time, it is easy to pick a random dataset to work with. In my opinion, choosing a great dataset is essential in any data engineer project.
What makes a dataset valuable
For it to be a valuable dataset, it should be able to produce high-quality output. According to my mentor Andreas Kretz, these are factors that make a dataset good:
CSV files, because they are easy to understand and work with
Lots of rows instead of columns
Time Series data
Text columns (useful for search indexing or further processing)
Columns with categories
Numeric values that you can calculate stuff with
Unique column names
Enough rows so you can simulate streaming over a few minutes/hours
Most important of all, it should be the dataset that interests you the most.
Covid-19 is an urgent matter that every country is going through. I am curious to work with this dataset and to understand what the trend of the affected cases overtime is.
I picked my dataset from Kaggle:
This dataset is excellent because:
Real data that is currently updated daily
the data type is CSV
Their columns names are unique
Data contains text columns, numeric values, and time series value
Great options for visualization
Thanks for reading! If you have any comments or thought, I would love to hear them.