Learn Data Engineering
Only $19.97/month
Search

Simplistic Ways to Find Interesting Data Sets

I am taking you through my recent experience to find a dataset for my project.


Industry Search


To work with data, I need to narrow down the industry like health care, finance, insurance or other. I defined a few sources in my earlier blog post, which will give a sneak peek of techniques to extract industries.

For Instance, most of the job listings introduce their job description as,

One of the top insurance client looking for Data Engineer

which exposes the industry. Grouping multiple jobs by industry give you the statistics.


Criteria


Define a simple layout to your dataset with elements like size, type of columns, format.

For example, I am looking for a CSV data set of size 12 columns and 60k rows with text, numeric and date-time columns.

I described my credit card complaints data-set in the picture.


link to data set https://data.world/dataquest/bank-and-credit-card-complaints


Ways to find


Industry + Criteria + Google = Dataset

I picked up banking and Insurance as my industry type and used google to search for open datasets.

Look datasets in popular data science competition platforms like Kaggle, Analytics Vidya, KDnuggets and Driven Data, before jumping to raw search on google.


Once you have found some, try to visualize data, by framing a set of questions that you want to answer with the dataset.



Data Set makeup


Like any other, the dataset also requires some pre-processing to make good use of data,

  • In my data set, I don't have data for the sub-issue column, I figured out issue types for credit card complaints and populated 87718 rows, randomly choosing from that set of sub-issues (I did it just to have some data than nothing).

  • Date formatted to yyyy/mm/dd.

  • Imputed some missing values using python.


Handy repositories


Make effort to search, so that you can explore

Or

Look into the awesome link below,

https://github.com/andkret/Cookbook/blob/master/sections/07-DataSources.md


Conclusion


The main aim of this post is to help someone, finalize a dataset for Data Engineering Learning.

let me know in the comments if I am missing anything, please.

96 views

© 2020 Team Data Sicence - Andreas Kretz

  • Black LinkedIn Icon
  • Black Twitter Icon
  • Black Facebook Icon
  • Black Instagram Icon