For building any data engineering pipeline, the first and foremost step is off course to get the data. Data Engineering pipelines are like different stages of a game where you have to complete a particular task to proceed to the next level. We all know, data is the soul of data engineering! To sum it up, no data…no data engineering…
Great, you now know that you need the data. Now what?
- Which data do I need to get?
- Where do I get the data from?
- How do I get the data?
- What format of data do I need to get?
If you get these questions coming to your mind, kudos to you! Your brain is on the right track!
All of the above questions lead to the keys of something known as ‘Data Ingestion’. The name itself denotes that it is the process of bringing in the data and putting it somewhere just like the food we eat which ultimately gets stored in our stomach (On a side note, I am feeling the urge to ingest some food right now!).
Let’s try to answer each question in the most simplified manner to get started with data ingestion pipeline.
- Which data do I need to get?
An extremely important question which you need ask yourself frequently. This will eventually process your mind in finding or querying the critical data which could help in determining valuable business insights and forecasts as well as improve your analytical skills. Data Engineers work closely with Data Analysts, Data Scientists, DW/BI Developers, AI/ML Engineers etc. where they literally build the infrastructure aka ‘Pipelines’ for the real-time, batch processing and streaming data to be loaded, transformed, moved, stored, processed & visualized. In short get the data ready for the extended team. For relational data, a good understanding of database modelling or ER diagram can help in finding the right data set.
- Where do I get the data from?
Well, you can literally get data from anywhere. Yes, anywhere! Be it from any data sources, any files, logs, feeds, social media, sensors etc. If any device or machine is capable of generating data, you can get it or to sound great, you can ingest it. If you work for a company project, your organization will definitely have multiple databases, files, logs or some other storage where you can get the data from. If you want to work for a personal project, you can pull data from any publicly available data sets. Just Google it! I like to grab it from Kaggle.
- How do I get the data?
Okay life is good, we have the data. Now what? You have the food but you don’t know how to eat it! Trust me life is still good! In the data engineering world, we have tons of ways to eat food or rather ingest the data which you have identified. Depending on what your organization has, an on-premise Big Data setup or cloud subscription, there are different ways and tools which can be leveraged for data engineering pipelines. There are multiple tools available in the form of open source as well as on different cloud platforms which can help ingest data like Apache Kafka, Nifi, Storm, Flume, Amazon Kinesis, Azure Data Factory, Google Cloud Data Flow, Sqoop, Hue etc. You can also automate the ingestion process from a website directly via some of the listed tools or by developing an API like Rest API using Python, JavaScript etc. Automating the data ingestion process is the best and recommended way here as it would mainly avoid the hassle in manually downloading and importing the data.
- What format of data do I need to get?
Data is available in different formats. Some formats are easier to extract than others and requires different storage solutions. Typically, there are three types of data formats:
o Unstructured Data
Basically, these are raw form of data in the form of texts, images, sounds or videos. These are typically stored in a file store format like a directory on your PC’s hard drive.
o Structured Data
Straight forward data format. Structured data is usually a relational or tabular format data having rows and columns like MS SQL Server, MySQL, Oracle, DB2 etc. Usually we use SQL to get the required data.
o Semi-structured Data
These are neither structured nor unstructured data format and often stored as files like email, XML, JSON, CSV, TSV etc.
Congratulations! Once you have answers for all these questions, you are ready to build your data ingestion pipeline!