Marc de Beurs
Approaching data science from two perspectives: Academia vs. Industry
During my PhD at Nikhef (the National Institute for Subatomic Physics), I analyzed the measurement result of the Large Hadron Collider (LHC) at CERN. The LHC is a particle accelerator and collider built to study high energy particle collisions in order to understand our universe better. This type of analysis is best described as BIG data analysis, because around 20 collisions happen every 25 nanoseconds when the machine is operational.
After completing my PhD, I founded Yabba Data Doo together with 2 companions, where we help businesses unlock more value from their data. In this article, I would like to share the differences I experienced while conducting the analysis during my PhD (academia) and the work I do now (in the industry).
Undebatable measurement results
Whenever people work with data, tailor made transformations are applied to it in order to turn it into useful information which is meant for a specific purpose. This is very natural since different departments (or research teams) all have their own interests. By definition, this is not a problem; it only becomes one when these transformations are not logged, rendering it impossible to compare information extracted from data by different teams. What can you trust if there are different “versions” of the same information?! Is it still possible to derive insights, or do you question all information derived from the data at this stage since they do not correspond?
Down at the ATLAS detector, CERN
At CERN, these types of situations are always resolved by going back to the measurement results. In data science, this is called the Single Source of Truth (SSOT): having 1 location where all (untransformed) data is stored. In academia the SSOT always exists, the measurement results are the undebatable truth. But interestingly, I have found that this is not the case in the industry. It is therefore extremely important (if you are an industry practitioner of data science) to include it in the design of your data landscape.
From leaders in the industry, I have heard that you should not ask HR how many people work in your company, they do not have a clue. You should ask finance, because salary transactions have the strictest policies in place.
Sharing is caring
After having a firm SSOT in place, it is actually good to promote teams transforming this data differently in order to serve their specific purpose best. This will only make your teams more effective, since not all data is needed for every team. Your company as a whole will therefore have Multiple Versions of the Truth (MVOTs). There are, however, a few commons problems that could arise here:
Data definitions may be ambiguous and mutable,
Data rules are vague or inconsistently applied,
Feedback loops for improving data transformation are absent,
No mechanisms for sharing knowledge, analytical tools and best practices.
All of this can be resolved (or more accurately, prevented from happening) by having data governance policies that are enforced and constantly monitored. This is where academia can learn from the industry. Not many universities (or research institutes) force their employees to follow a policy when it comes to doing data analysis. At CERN, there were tools for cleaning and calibrating the measurement results, but each research team was often reinventing the wheel when it came to statistical analyses. Of course, the teams learn by doing it themselves (and a university is also an educational facility) but it costs a lot of time, that could instead be used for pushing scientific knowledge further.
Companies can more easily enforce data policies and make sure that employees do not waste time doing work that has already been done by others. People often dislike writing and following a policy, but everyone likes to talk about what they did and share their experience (and expertise). There are therefore other ways (more fun) to have people in your company learn from each other. For example, at Yabba Data Doo we organize a monthly “Geeky Wise” session where employees talk about things they are working on and want to share with the rest. These talks given by the speakers are saved and always openly shared with all the employees, so if someone needs to do something similar, they already have some knowledge and examples of how to go about it.
A peek at a Geeky Wise session at Yabba Data Doo
How to get the most out of the data in your company
In conclusion, I would recommend that the best practices to follow (from both academia and the industry) are to:
have a SSOT by design,
allow your teams to create business specific MVOTs,
make sure to enforce data governance policies.
This way every team can use the data they need to arrive at useful information, without you losing the possibility to compare information from different teams (and with that losing trust in all data). On top of that, with the right mechanisms in place, less time is lost reinventing the wheel and, as a side effect, your employees are happier because they learn from each other.
About the author: After his studies Applied Physics at the TU Delft, Marc continued his academic career as PhDer at CERN. During this time, he mastered complex data modelling. Now as Yabba Data Doo’s technical director he unleashes his data superpowers for a sustainable world. Marc is as ‘geeky wise’ as he is a people’s person. Being responsible for the well-being of employees, his mission is to support the development of young data scientist in order to use their force for good.