Data Science Center Tilburg header image

Data Science Center Tilburg

The First Year of Data Science in Action Projects

DSC/t Blog - April 2016 | Marie Postma Nilsenova – Coordinator of MSc in Data Science: Business and Governance

The power of Data Science lies in the processing of massive amounts of data. It is thus a necessity for any student of Data Science to gain experience in dealing with large data sets. Publicly available data collections are easy to find in well-known repositories listed, for example, on KDnuggets, the UCI Machine Learning Repository and Kaggle. Some of the open data sets have become notoriously known among our students because they are particularly suitable for certain types of analyses. Classification? – Try to predict the math grades of Portuguese students in the Student Performance data set. Regression? -  Use the Boston Housing Data with information about housing prices in Boston suburbs. The advantage of using publicly available data sets from repositories is that they have been cleaned and documented before they were put online for general use.

Dirty Data

However, real Data Science is not performed on csv-files with neatly ordered instances and well-behaved attributes. To practice the full workflow, students need to be exposed to dirty data sets that require considerable cleaning and preprocessing, so-called ‘data munging’, before machine-learning algorithms can be applied. In fact, according to a recent survey discussed in Forbes Tech (March 23, 2016), the sexiest job of the 21st century mostly involves cleaning and organizing data (60% of the time), followed by collecting data (19%) – both of which are not exactly data scientists’ favorite activities (to paraphrase @BigDataBorat, in Data Science, 80% of time is spent on preparing data and 20% on complaining about the need to prepare data).

Dirty data may involve errors, missing values, values that need to be reformatted, incomplete records, duplicated records, and corrupt records, as well as poor metadata and inappropriate feature descriptions. Dealing with dirty data sets introduces considerable technical challenges; importantly, these challenges can only be overcome with the help of a domain expert.  

Data set owners

The necessary domain expertise is, typically, supplied by data set owners. However, these experts are not available for data sets contained in public repositories. Therefore, an important component of the Data Science educational products offered by Tilburg University is the collaboration with external partner organizations, willing to give access to internal data as well as domain knowledge.  In exchange, they are able to make use of the available scientific knowledge provided by the academic advisors and, of course, the time and effort invested by the Data Science students into the projects.

Data Science in Action  

In our Master track ‘Data Science: Business and Governance’ offered since 2015-2016 by the four Tilburg University Schools (Humanities, Law, Economics, and Social Sciences), students experience the real Data Science workflow in the context of their Master thesis projects, called ‘Data Science in Action’. In these projects, they collaborate with commercial organizations, governmental institutions, as well as external research groups. For example, student Tom Terneusen works together with The Hague Center for Strategic Studies (HCSS) to analyze their open data set of indexes that express relations between two foreign nations. The goal of his project is to provide a decision support tool for HCSS experts that would help them determine if changes at different times scales can predict conflicts or escalations. In another project, students Sebastian Bertoli, Lan Nguyen, and Ander Gutierrez Bejarano, collaborate with Greenpeace on data reflecting the behavior of vessels near the West African coast. The hope is that machine learning techniques can help Greenpeace detect possible instances of illegal fishery. Yet another Data Science in Action project of student Kaz Roomer is done in cooperation with Karify (a provider of online e-health platforms) and Radboud University Medical Centre. The idea is that text mining and data mining techniques can help researchers at Radboud UMC develop tailored programs and interventions for cognitive behavioral change offered to cancer survivors.

Experiences so far

The Data Science: Business and Governance Master attracts students with various Bachelor backgrounds and one of the challenges for the Data Science in Action projects was trying to find a good fit between student profiles and company needs. Students and external partners were matched during an initial kick-off meeting organized in November 2015. The companies present during the meeting included KPN, Heijmans, Yourzine, SDU, SAP and the Greenhouse Group, as well as governmental and research institutions such as the Municipality of ‘s-Hertogenbosch, CBS and CenterData.

Thanks to the good will and enthusiasm of everyone involved, the cooperation between our program and external partners can be characterized as promising (we will know for sure in August 2016 when the majority of the students will be finished with their thesis). The challenges we have been dealing with, such as the formulation of non-disclosure agreements, information/data sharing and storage, as well as scientific embedding of often highly applied research questions, have resulted in expertise that will hopefully be of use to other lecturers in the Grand Initiative Data Science.