Big Data

Accountability and transparency in Big Data land

DSC/t Blog - May 2016 | Ronald Leenes – Tilburg Institute for Law, Technology, and Society; TILT, Tilburg Law School

Large organisations, such as public administrations, have always dealt with large volumes of data. They are now increasingly adopting sophisticated big data technologies to further leverage the data. Data provided by citizens and customers, observed by sensors, deduced and inferred from diverse data sets are being used to discover patterns that can be transformed into actionable knowledge. The data are also used more directly for making predictions about the behaviour of individuals and groups of people, and for making decisions about them. The data and algorithms are not neutral nor flawless, but contain biases and (hence) may induce harms to individuals, such as discrimination, loss of autonomy, infringing their privacy. Big data decision making systems are usually opaque. In many cases this is problematic.

Data practices

Big data practices come in many shapes and forms and use different kinds of data. Of particular interest to me is personal data, which is any information relating to identified or identifiable individuals as the Data Protection Directive defines it. There is an enormous abundance of personal data out there. It is consciously produced by people, but also unconsciously produced by the devices and things they carry and use (such as smart phones and OV chip cards), or simply by the behaviour they exhibit (think of webbrowsing). Personal data brings about concerns that do not exist with respect to for instance, water levels.

Obtaining data

Personal data can get into the hands of the data industry (data controllers) in different ways. A common distinction is that in provided, observed, derived and inferred data[1]. Provided data are data which originate from direct actions taken by the individual, whereby he or she is fully aware of the actions that lead to data creation. Examples include: data disclosed by individuals in the context of a loan application (“initiated data”), data created when buying a product with a credit card (“transactional data”), or data shared (actively) via an online social network (“posted data”). While the individuals concerned may be unaware of the implications of providing these data, the fact that these data are being created should be obvious – or at least intuitive.

Observed data are data which have been observed by others and recorded in a digital format. Examples include: data originating in conjunction with online cookies, data generated by sensors (as in smart phones), and passively created observational data (e.g., data captured by CCTV cameras combined with facial recognition). Although individuals sometimes may be aware of the creation of observed data (e.g., because of a camera sign signaling CCTV surveillance), oftentimes they do not.

Derived data are data built in a relative simple manner on already existing data, for instance by aggregating individual customer records to an annual overview of customer value.

Inferred data are the product of probability-based analytic processes. They are based on the detection of correlations and can be used to make predictions about behaviour. Typically, the individual is unaware of these kinds of predictions being made.

The distinction in data sources is relevant because it says something about the level of control that individuals have over their data. One can decide not to disclose certain data. It is more difficult not being observed. If one is unaware of data being combined and analysed, control becomes almost impossible. I would claim that ‘as a consequence’ or as starting point for a counter balance, the entity collecting and processing data about individuals has a greater responsibility regarding the use of the data the lower the level of control the individual has.

Predictive analytics

Inferencing, is pretty much the name of the Data Science game. There is much to be gained by being able to predict the future on the basis of data, especially that of customer/citizen behaviour and act on that knowledge. Acting often means taking decisions about individuals, what advertisements to serve them on the basis of inferred preferences, what prices to charge them, but also whether or not they represent a (financial, health, security) risk. These decisions (seriously) affect people and if they are taken by machines, some caution is in order because it may be hard to contest the decision and argue with the machine.

Consequential decision making in the age of Big Data

Automated decision making is not new. What may be new is that the way machines gain the ‘knowledge’ to make decisions may change. Previously, regulation (in the case of public administration, such as in taxation) or business rules (in the private sector) were turned into decision rules that could be executed by machines. The mapping between input and output in these cases is known, because it is based on a logic that predates the computer algorithm that executes it.

Also in Big data configurations the mapping between input and output may be known. In the case of supervised machine learning, the mapping may be complex and fuzzy (there may be just a training set comprising (typical) examples the classifier needs to learn to distinguish), but in any the training set is consciously provided by the ‘data scientist’. Also the decision whether or not to use the resulting algorithm for automated decision making is taken by a human.

In other cases, this may all be less clear. Google’s search engine, for example, is totally opaque, yet very good at what it does. There is no human intervention in the process from knowledge discovery to decision making involved in a growing number of big data applications.

Risks and a need of checks and balances

Automated decision making requires appropriate checks and balances because, as mentioned above, it may be hard to contest the machine’s decision. This is the case in all machine decision making (and in fact in all decision making).  Humans make algorithms or decide whether they are used or not, and hence there is the risk that their biases make it into the algorithms.

This risk, and others, have been acknowledged by regulators long ago. This is why due process, transparency, accountability, to name a few, are important requirements in the realm of public administration (and enshrined in legislation such as the general Administrative Act (Awb) in the Netherlands). This helps guarantee fundamental rights of individuals.

Transparency and accountability to come

In the age of Big Data, due process, transparency, accountability are particularly important. Certainly in cases where people don’t have control over their destiny or over their data (hence in cases of derived and inferred data).

And although there may be very valid reasons not to be transparent – for instance because it may facilitate gaming the system –, the European regulator has adopted a strong position on transparency and accountability in the context of the processing of personal data.

The General Data Protection Regulation that will come into force in May 2018 provides strong transparency provisions. A core provision in this context is art 14 para 2. g, which states “the existence of automated decision-making, including profiling, … , meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.”

The message is clear, you can’t get away with “because the machine says so”. The consequences are also clear. If the supervisory authorities take this requirement seriously, the scope of the provision is enormous and so are the sanctions the regulators can impose. What is less clear is what this means for anybody using automated decision-making. Especially when the underlying algorithms emerge out of advanced machine learning techniques.

When the mapping between input and output is clear, explaining how a conclusion about an individual in a particular case came about is doable. But in cases of unsupervised learning?


The new GDPR provides European citizens with many new and enforced rights that should safeguard their personal data and protect their privacy. It also provides a set of challenges for anyone who uses personal data. For one, these so called data controllers will have to be transparent about the algorithms (logic) they use in their processes. A challenge for us researchers will be to help them (and others) in understanding what meaningful algorithmic transparency is. I am sure the Data Science Center Tilburg (DSC/t), the Jheronimus Academy of Data Science (JADS) and the Data Science Center Eindhoven (DSC/e) will play a role in this venture.

[1] Report 'Rethinking Personal Data: Strengthening Trust', World Economic Forum, 2012 (pdf)