Hallo Andreas,
I'm what you could define an old school data engineer having mainly worked for years in the Oracle/SQLServer technologies quicksands for building ETLs and data plumbing to provide support for analytics. I'm now belatedly starting to wet my feet in the world of "Big Data" and modern data engineering in general -your fault :)- and am bewildered by the array of technology offerings out there. I've been asked to do a presentation as part of a lengthy job interview process to demonstrate how I would build an hypothetical system for storing and analyzing the following type of data:
Excel / CSV - External API based data sources (e.g. websites) - IOT data feed - Photographs - PDF’s of reports - Spatial data (geotiffs, .shps, GeoJSON).
The system should be capable of:
- Storing all customer data linked to services and products Company X provides (these are data &
analysis based)
- Collecting and storing events and updated data on all data-fields either daily, weekly,
monthly or annually
- Data is used to provide customers (internal and external) with real-time actions
/alerts/insights/support decisions, therefore must have a minimum uptime of 98%, be
reliable and operate at a reasonable speed
- Data is ready for machine learning techniques and tools to be utilised
- Must scale as company grows with additional customers onboarded at a variable rate
- Must scale as company develops the products to allow POC/NPD and operational use
My question is first of all around which database(s) are best suited for managing that kind of data (MongoDB, Cassandra, Relational etc.?). Secondly which processing framework would be best suited. Thirdly, cloud vs on-premise ( There are no software/hardware restrictions - There are no cost restrictions)
I'm definitely going to get inspiration from your data science platform blueprint as my ...blueprint... but I guess where I'm struggling is replacing symbols with the appropriate software especially in the context of self-managed vs. managed, cloud vs on-premise.
Any help/suggestions would be very much appreciated!
Best
Nico
@nicom FYI I saw that question, but I think it's too complex to just write an answer. I'll record a video about that and link it here. (if it's a live stream I'll add the time here as well)
Hi Andreas,
First of all thank you very much for even considering to reply! I know the question is broad and was a bit "unverschämt" on my side to post it but desperate times call for desperate measures.
I've had a first stab at designing a possible framework. here's what I've come up with:
The premise is that there are no cost constraints so I opted for managed services/PaaS model.
I thought about adding Kafka/Kinesis for message queues but don't think the data ingestion volumes/frequency warrants it.
I'm still mulling over adding a RDBMS for the structured data though...
Next stage interview is Thursday :(
Best
Nico
How was your presentation? I'm intrigued.
The presentation was a tale of two halves. Although the massively simplified architecture was ok what became apparent was my lack of experience in these technologies when it came to describe in detail how the proposed framework would be operationalised.
Didn't get the job in the end but it was a good experience/exercise to focus the mind.
So for me now it's study time with Airflow and Spark as low hanging fruits.