Dataset based on webscraping
In addition to public datasets, webscraping from (semi) public sources is regularly used. If a researcher wants to scrape forums, social media or other (semi) public websites, there may be copyright and terms of use of the public source.
What is webscraping?
A computer technique in which software is used to extract information from web pages and whether or not to analyse it. Usually the software tries to investigate part of the world wide web using the code based Hypertext Transfer Protocol (HTTP), or by simulating surfing behaviour with a web browser?
The use of webscraping
For academic research, a researcher may process personal data by means of web scraping under word values if they are public and have been collected for a similar purpose. This also applies to special personal data that are clearly made public by the person concerned.
Note: Copyright and terms of use of the public source may apply.
In addition, the researcher should also take into account the context in which the public information has been placed. The public information may be used for academic research if it has been written for the purpose. This also explicitly applies to special personal data that have been clearly made public by the respondent himself.
A number of examples to clarify:
- If a researcher uses AirBnB blogs (in which travelers reflect their experiences) that are public to find out if tourists are ethnocentric.
When writing the blogs, authors could not have guessed how their texts would be used and might not have given permission if asked. This information may not be used and the researcher should then ask explicit permission from the authors of the blogs. - If a researcher uses a public blog on Facebook in which someone writes about personal experiences with cancer, with the aim of informing peers and loved ones.
In this case, the researcher may use this information to compare patients' experiences. - If a researcher uses a blog on a private forum that is not public (but to which the researcher has access for the purpose of the research), it may not be used for scientific purposes.
Establishing lawfulness and purpose limitation
If personal data are processed in a academic research, the so-called legitimacy and purpose limitation must first be established. After this, the so-called material requirements must be taken into account in order to ensure that personal data are handled with care.
Lawfulness
- Do we have a legal basis for processing?
Target binding
- What do we want to do?
Material requirements
- Do we handle Personal Data with care?
Processing base
Processing basis inacademic research is research in which a new dataset is set up in which research data is collected without obtaining it directly from respondents: Justified interest. In this case, the research data are collected on the basis of information made public by the respondent himself/herself.
Example:
- Setting up a new dataset using webscraping.
Special personal data
According to the GDPR, special personal data may only be processed under strict conditions. In the case of academic research, the ban on processing special personal data may be lifted subject to certain conditions.
Obtaining permission for web scraping is impossible or requires a disproportionate amount of effort:
- Informing the public by means of a privacy statement
Target binding
The second requirement is that there must be purpose limitation: there must be a well-defined, clearly defined purpose.