How to (truly) own your big data
In this article, we will take a dive into the principles of managing data. Presently, organizations gather more data than ever, and moreover, this data is coming in a wider variety of shapes and sources. This is made possible by the availability of newer and more advanced software (tools) in general. A lot of the paperwork from back in the day is nowadays replaced by a digital counterpart, which can decrease the overhead, but raises the importance to prevent data from getting lost (and to prevent yourself from getting lost in your data). This makes secure and robust data management a pure must.
The V's of Big data
If you only have data with a lot of volume or variety (or the combination), it is not considered big data yet.
Before we dive deeper into data management, let me first explain what I am talking about when it comes to big data. Commonly, data is called big data when it adheres to the Vs of big data. These are: Volume, Velocity, Variety, Veracity, and Value. This means that there must be a lot of data (volume) that comes in at high speed (velocity), has a wide range of different data types (variety), can be biased, can contain errors or missing data (veracity), and it must create value to the organization (value). Note that data is only called big data if it satisfies all these characteristics. For example, if the data only has a high volume or variety (or the combination), it is not considered big data, because in the end you can only use it to create static reports that will not necessarily change over time. Furthermore, if you look at these characteristics, you’ll notice that with big data it is impossible to process it all manually. Therefore, you need to properly set up a robust infrastructure to process all the data that comes in (and make sure to not drown in it).
Extract, transform & load
When you have this big data, the first thing you need is to do is processing. A popular example of a processing flow for big data is through an ETL-pipeline (Extract-Transform-Load). Here, the general idea is that the first step is the extraction of all the data form the various sources; the second step is the transformation of this data, so correcting, restructuring, and reformatting it, and if applicable, also performing calculations on the data; and the final step is loading it into a database. When you automate this process, you’ll have a database that is up-to-date and always contains the data that you need, and in the right format.
To provide better integrity and make the pipeline more robust, usually there is data tiering involved.
Once you have an ETL-pipeline, it seems like you have the process nicely automated and that you can trust on the veracity of your database. However, to provide better integrity and make the pipeline more robust, usually there is data tiering involved. This means that you have different tiers of the same data, but they are all processed on a different level. These tiers can be referred to as the bronze, silver, and gold tiers. The bronze tier is meant for storing the ‘raw’ data- this is the original, unprocessed data, which is stored just as it comes in. The silver tier is for the processed data- it is cleaned and structured so that it can already be used for processing or analytics. This is the tier from which the data scientists work. The third tier is the gold tier. This is the final tier, where the data is not just cleaned, but completely ready to use. From this tier you should be able to easily retrieve the data you need, also for analytics or applications. The gold tier is the layer where data analysts usually retrieve data from. These tiers are set in place so that you can process the data to your needs, but also retain the original data to back-track with mistakes or to back-up from when data gets lost.
In conclusion, these are the general principles of managing big data. You have the Vs of big data by which you can recognize that you’re dealing with big data but when you’re working with big data, you can set up an ETL pipeline to handle the data. The tiering principles are set in place to enable more control over your data and allow for the flexibility to implement changes in the pipeline or correct mistakes.
About the Author: Alex Hakvoort is a data engineer at Yabba Data Doo with a Master’s degree in Data Science at the University of Amsterdam. He discovered his passion for data through his previous studies in Information Sciences. Working at a consultancy, he likes the challenge and diversity of each project. As data science is applicable within every field of society, no project is the same. During his free time, Alex loves to cook elaborate meals, work out and travel.