Synthetic data

DUO (Dutch Education Department, Dienst Uitvoering Onderwijs) has announced that it can create synthetic datasets about education at the student level. Synthetic data resembles real data in its properties and relationships, with the key difference being that the privacy of individuals is protected. Education researchers can apply to DUO to get this data. If DUO’s data fits the research then researchers receive a synthetic version of the data.

The use of synthetic data ensures that, if DUO has been careful, the synthetic dataset no longer contains personal data. As a result, the researchers do not have to worry about the GDPR. It is also then no longer a problem for DUO itself to share the data, as it is no longer personal data.

HOW DOES IT WORK?

The theory of synthetic data is that one tries to describe the original data set in probability distributions. This is a step from individual measurements and values to general properties. New individuals are then generated based on these general properties. If the general properties of the original dataset are well described, then the generated dataset will be indistinguishable from the original dataset.

DUO used the R programming language and the synthpop package for this project. This package is an implementation of a specific way to generate synthetic data. Here the different variables are generated one after the other. Whereas the first variable is generated purely from the original variable, all subsequent variables use the previous variables to determine their probability distributions. This preserves relationships between variables, such as men being taller than women and taller people being heavier, for example.

PRIVACY

The main reason for using synthetic data is to protect the privacy of the people in the original dataset. If the generation of the synthetic data is done properly, then it is no longer possible to identify the people in the original dataset. Then the synthetic dataset is anonymized and therefore the GDPR no longer applies. Using the original dataset to create a synthetic version fits nicely with Article 89 of the GDPR. That article sets the conditions for using personal data for scientific research, among other things.

What should you pay attention to when generating synthetic data? First, it is important that the method for generating the data is capable of properly hiding the original data. It should not be possible for the generating model to retrieve the original data. Second, it is also important not to be too cautious. A model that is too general will generate a dataset that does not contain the details of the original dataset. Finally, it is important to check the final synthetic dataset before sharing it to make sure there is nothing weird in it.

CONCLUSION

Synthetic datasets offer a great opportunity for researchers to conduct research without privacy concerns. Also, this technology offers organizations with lots of data like DUO the chance to share this data with researchers without compromising the privacy of the people in those datasets. When organizations collaborate, synthetic data can be a solution to GDPR problems. Of course, it is important that the method of synthesizing the data is actually able to protect the individuals in the original dataset.

Details

Created 18-07-2023
Last Edited 20-07-2023
Subject Using data

Synthetic data

Synthetic data

Details

More questions?

Recent Articles