I am taking you through my recent experience to find a dataset for my project.
To work with data, I need to narrow down the industry like health care, finance, insurance or other. I defined a few sources in my earlier blog post, which will give a sneak peek of techniques to extract industries.
For Instance, most of the job listings introduce their job description as,
One of the top insurance client looking for Data Engineer
which exposes the industry. Grouping multiple jobs by industry give you the statistics.
Define a simple layout to your dataset with elements like size, type of columns, format.
For example, I am looking for a CSV data set of size 12 columns and 60k rows with text, numeric and date-time columns.
I described my credit card complaints data-set in the picture.
link to data set https://data.world/dataquest/bank-and-credit-card-complaints
Ways to find
Industry + Criteria + Google = Dataset
I picked up banking and Insurance as my industry type and used google to search for open datasets.
Look datasets in popular data science competition platforms like Kaggle, Analytics Vidya, KDnuggets and Driven Data, before jumping to raw search on google.
Once you have found some, try to visualize data, by framing a set of questions that you want to answer with the dataset.
Data Set makeup
Like any other, the dataset also requires some pre-processing to make good use of data,
In my data set, I don't have data for the sub-issue column, I figured out issue types for credit card complaints and populated 87718 rows, randomly choosing from that set of sub-issues (I did it just to have some data than nothing).
Date formatted to yyyy/mm/dd.
Imputed some missing values using python.
Make effort to search, so that you can explore
Look into the awesome link below,
The main aim of this post is to help someone, finalize a dataset for Data Engineering Learning.
let me know in the comments if I am missing anything, please.