The purpose of this project is to solve a business problem given a dataset as input.
The case is about solving the abandonment cart to increase sales conversion and gain real-time insights for the e-commerce industry.
the main goals are:
identify the most suitable approach to solve the given business problems
implement/build and deploy the target architecture and data pipeline
Analyze the output conformity to the given business problems
Share the implementation details via blog posts and YT videos.
What we will build?
data pipeline for streaming and batch processing
CI/CD chain to set up the infra and used managed services
Data visualization dashboard for business users
What we will learn?
How to choose the right service/product for data processing to fit business requirement
How to implement streaming and batch processing data pipelines
How to implement machine learning models
How to communicate final results
1- Business problem#1
retargeting customers with abandonment cart within 3 min maximum of the session expiration
The use case basically is to increase sales conversion by reducing cart abandonment. the output. the retargeting format is to send a personalized email to the subscriber offering free shipping.
2- Business problem#2
Identifying products with the most sales attraction in the future in order to anticipate the supply
The use case basically is to increase sales conversion by reducing cart abandonment. the output.
3- Business problem#3
Calculate Customer lifetime value
4- Business problem#4
visualize sales data per product within 30 to 45 seconds once the payment is done
1- set up the data pipeline for streaming and batch processing
Ingest sales data in realtime
clean up the data / ensure data quality (missing values and negative ones)
prepare the cleaned data to :
Business analysts for visualizations
data scientists and ML engineers to build ML models
ie: probability estimate of a sale for each product
instant reaction to retarget customer with abandonment cart by making new offer to the subscriber and reach it via email in the next 3 minutes (free coupon or free shipping offered)
Archive historical data to be re analyzed by data scientists for future ML models
2- set up the CI/CD chain
1- automate infra and managed services creation and destruction
2- setup continuous integration and continuous delivery chain
This data contains behavior data for 5 months (Oct 2019 – Feb 2020) from a medium cosmetics online store.
Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.
Note: if this dataset is too small for you, you can try larger dataset from multi-category store.
There are different types of events. See below.
Semantics (or how to read it):
User userid during session usersession added to shopping cart (property eventtype is equal cart) product productid of the brand of category code (category code) with price at event_time
The time when the event happened at (in UTC).
Events can be:
view - a user viewed a product
cart - a user added a product to shopping cart
removefromcart - a user removed a product from the shopping cart
purchase - a user purchased a product
Typical funnel: view => cart => purchase.
ID of a product
Product's category ID
Product's category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.
The downcased string of brand names. Can be missed.
Float price of a product. Present.
Permanent user ID.
Temporary user's session ID. Same for each user's session. Is changed every time user comes back to the online store from a long pause.
The architecture below is the target to build on GCP. component choices are explained on further articles in the series