When it comes to choosing the storage option for your data pipeline, an initial analysis for business and the technical requirement is needed before proceeding.
This article belongs to 3-parts series about our case study analysis(you can check the case study details here):
Part I – 3-steps procedure to identify your storage options
it’s basically about explaining the approach to analyze those requirements.
Step #1: Business problems analysis
This is the entry point for the analysis and consists of :
defining your use cases and business problem you would like to solve.
use cases need to be specific and to address one unique problem. let’s explain it with the following example :
saying that “I would like to visualize my sales data” is not precise and considered as a bad formulation.
instead, if I say “I would like to visualize sales data per product within 30 to 45 seconds once the payment is done” is much more specific and gives indications about data availability time to take it in consideration when designing the data pipeline
Define Service Level Objectives(SLO)/ metrics and KPI. different types of SLO can be identified(1):
end 2 end SLO
per stage / per component SLO
timeliness vs skewness vs completeness SLO
we will concentrate on End2End SLO for our use cases
Step#2: Technical requirements
Once your use cases are defined as well as SLO, we can settle them in terms of technical requirements. to do that we need to answer the following questions:
1- What are my data sources?
2- The data is Structured vs Unstructured or Both?
3- What are the data models I will receive?
3- What are the volumes of such data (per second/per hour/per day)? what are the volumes in busy hours?
4- What are the End2end availability time?
and then take in consideration the following rules
separate storage and compute: data need to be stored in a way to separate it from the compute layer
scalability: your data storage need to scale when necessary
data retention policy & costs
Step#3 identify storage option
This final step is to map the requirements identified in step#2 with data storage options according to your context if you are using public cloud providers or other options.
for now, we are going to use GCP and storage options offered by this platform
GCP offers various storage options and choosing the appropriate option is up to you. here the general guidance:
Storage options offered by GCP :
Cloud SQL is a fully-managed database service that makes it easy to set up, maintain, manage, and administer your relational databases on Google Cloud Platform(3).
Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development(4)
A fully managed, scalable NoSQL database service for large analytical and operational workloads(5).
Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility.(6)
5- Cloud Spanner
Fully managed relational database with unlimited scale, strong consistency & up to 99.999% availability.(7)
Additionally, to that, there is also “temporary” storage on message broker systems and for that let’s mention:
Messaging and ingestion for event-driven systems and streaming analytics.(8)
Integrated with Dataflow and BigQuery to form the Google Cloud-native Stream Analytics solution
Auto-scaling and auto-provisioning with support for up to 100 GB/second
Independent quota and billing for publishers and subscribers
Global message routing to simplify multi-region systems
Push and pull message delivery
7- Cloud Storage
Cloud Storage is a service for storing your objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. All buckets are associated with a project, and you can group your projects under an organization.(9)
How to choose the appropriate option?
Below a diagram explaining globally how to choose storage option on GCP among differents options listed above and according to your analysis in step#2
If your use requires Google Cloud Storage as a choice, you need to consider adding to that the file format for your data. here are a full explanation and analysis about choosing the appropriate format.
(1):https://landing.google.com/sre/workbook/chapters/data-processing/ (2):https://www.coursera.org/learn/gcp-big-data-ml-fundamentals/lecture/EY31t/choosing-the-right-approach (3):https://cloud.google.com/sql/docs (4): https://cloud.google.com/datastore/docs/concepts/overview (5):https://cloud.google.com/bigtable (6):https://cloud.google.com/bigquery (7):https://cloud.google.com/spanner (8):https://cloud.google.com/pubsub (9):https://cloud.google.com/storage/docs/introduction (10):https://www.coursera.org/learn/gcp-big-data-ml-fundamentals/lecture/s3wa2/approach-move-from-on-premise-to-google-cloud-platform