Why you never throw Data away in Data Science

Andreas Kretz
Apr 24, 2020
2 min read

Work colleagues who are not in the data science field often do not understand why we keep all data. Often data seems useless and should be thrown away in their opinion. But this is a very important point: never throw away data!

Ideally, you never throw away any data. It is best to always save everything. Because with machine learning and also with Data Science you never know when you need the data again.

So save the data, which may seem useless in S3, but never throw it away!

Because you never know when and where you will be able to benefit from this data in the future!

The formerly useless data could help you to build up a new business stream or to refine your algorithms.

Of course, there are cases where there is just so much that you can no longer handle it. One of my Youtube videos is about a case study from CERN.

CERN has described how they are basically throwing away the data step by step because there is so much that they cannot store it and it is impossible to store the amount of data.

But as long as you can, you should always save all the data.

Now back to your work colleagues:

If you want your work colleagues to understand why you store certain data and don't throw it away, I would explain it to them by means of a use case.

You certainly have an example from your past projects where you initially considered data to be useless, but at a later time - maybe in another project - it was useful again.Thank God you had saved the data, because now you remember the schema and then you brought it somewhere or reformatted it to fit a key value store and you could work with it.

Do you think it makes sense to save data that seems to be useless to be able to use it at a later time? I am looking forward to your comment!

>> created by Mira Roth

- Become a Data Engineer: Click here!

- My free 100+ pages Data Engineering Cookbook: Click here!

- Follow us on LinkedIn: Click here!

- Check out my YouTube: Click here!