I'm what you could define an old school data engineer having mainly worked for years in the Oracle/SQLServer technologies quicksands for building ETLs and data plumbing to provide support for analytics. I'm now belatedly starting to wet my feet in the world of "Big Data" and modern data engineering in general -your fault :)- and am bewildered by the array of technology offerings out there. I've been asked to do a presentation as part of a lengthy job interview process to demonstrate how I would build an hypothetical system for storing and analyzing the following type of data:
Excel / CSV - External API based data sources (e.g. websites) - IOT data feed - Photographs - PDF’s of reports - Spatial data (geotiffs, .shps, GeoJSON).
The system should be capable of:
- Storing all customer data linked to services and products Company X provides (these are data &
- Collecting and storing events and updated data on all data-fields either daily, weekly,
monthly or annually
- Data is used to provide customers (internal and external) with real-time actions
/alerts/insights/support decisions, therefore must have a minimum uptime of 98%, be
reliable and operate at a reasonable speed
- Data is ready for machine learning techniques and tools to be utilised
- Must scale as company grows with additional customers onboarded at a variable rate
- Must scale as company develops the products to allow POC/NPD and operational use
My question is first of all around which database(s) are best suited for managing that kind of data (MongoDB, Cassandra, Relational etc.?). Secondly which processing framework would be best suited. Thirdly, cloud vs on-premise ( There are no software/hardware restrictions - There are no cost restrictions)
I'm definitely going to get inspiration from your data science platform blueprint as my ...blueprint... but I guess where I'm struggling is replacing symbols with the appropriate software especially in the context of self-managed vs. managed, cloud vs on-premise.
Any help/suggestions would be very much appreciated!