“Differential privacy provides a quantifiable measure of privacy”
We produce and exploit ever more data, which has become an all-important resource for policy makers, researchers and businesses, as it provides them with information that is useful for their activities.
However, the processing of this data generates risks for individuals, communities and organisations. The use of personal data in particular can constitute a breach of privacy (on this topic read “A taxonomy of privacy” by American legal expert Daniel J. Solove).
Combined with regulatory tools and cultural changes within companies, privacy-enhancing technologies (PET) could help to limit these risks without cutting off access to precious data mines.
The PET acronym refers to a whole set of emerging technologies and approaches that aim to eliminate all possibilities of reidentifying a specific individual or organisation whilst preserving the data’s utility. Here follow several examples.
Homomorphic encryption and multi-party computation
Homomorphic encryption makes it possible to perform computations on encrypted data, producing an encrypted result that only the data holder can decipher. It can be used to outsource certain operations into the cloud.
For example, a company may wish to perform a data analysis in the cloud to save investing in computing resources, but without sharing the data with its service provider. Instead, they can send an encrypted version of the data to the server and once the analysis has been performed, the company can decrypt the result thanks to a private key.
Homomorphic encryption can help to implement other PETs such as secure multi-party computation (MPC). This approach makes it possible to analyse data held by several independent parties (competing businesses or healthcare organisations, etc.) who wish to pool their data without revealing its content (or using a trusted third party).
In the same way, MPC can be used for multi-party machine learning, whereby various parties share confidential data in encrypted format so as to obtain a better learning model from their combined data.
The first large-scale application of MPC emerged in Denmark in 2008. It concerned working out the market clearing price of sugar beets by introducing consecutive prices until the buyers’ and sellers’ intentions (of buying and selling at the proposed price) coincided.
In this situation, the bids made by the farmers reveal information on their economic situation or the productivity of their fields that could potentially be used in future negotiations. They would therefore have been highly reluctant to make this information public.
Differential privacy has been implemented by technology-sector companies (Facebook, Google) as well as by public organisations such as the United States Census Bureau, for the 2020 census.
Introduced in 2006, this anonymisation technique is aimed at minimising the risks of identification of an individual or organisation within a database.
The popularity of this approach can be explained by the fact that it provides a quantifiable measure of privacy. In the mathematical equation set out, the ε parameter (epsilon, also called “privacy budget”) describes the “acceptable” amount of information disclosed on an entity.
This is set by the database owner (and does not stem from a technical choice but from one of governance) and increases with the number of queries.
The desired level of privacy can thus be reached thanks to several mechanisms, such as the addition of random noise to query results.
This noise – fake, but plausible data – reduces the risks of associating sensitive attributes with a person or of deducing new information about this person (inference). On the other hand, it can adversely affect the pertinence of query results.
This is the main problem of differential privacy, which leads to a better compromise between data privacy and data utility with large data sets (in which less noise is needed). Its implementation therefore seems adapted in the scope of a population census and in theory it allows the publication of a greater number of statistical tables with more granular (i.e. finer) data, enabling policy deciders and researchers to carry out legislative or research work.
And what about the user side? Personal Data Stores
Personal Data Stores (PDS) are local or online storage systems that enable individuals to access and control the data they generate.
These tools and services, whose design can integrate a certain number of other PETs, offer users the possibility to retrieve and manage their data, and to decide with whom they share it, for what purposes and for how long. This is what we call “granular consent”.
As underlined by the Royal Society in its detailed report on PETs, “PDS enable a distributed system, where the data is stored and processed at the ‘edge’ of the system, rather than centralised”.
This can help avoid a certain number of problems linked to the excessive concentration of data in one place, which could make an organisation a highly attractive target for hackers or create a power asymmetry (benefiting the digital giants for example).
It could also lead to the emergence of new economic models, enabling individuals to monetise their data.
There are already several PDS solutions such as MyDex, MyData, CitizenMe or Solid, developed by Tim Berners-Lee, the inventor of the World Wide Web, at the Massachusetts Institute of Technology (MIT).
Reducing the drawbacks
Many reports suggest that privacy-enhancing technologies are promising. Once they are sufficiently mature, not only could they facilitate better protection of personal data, but they could also create new opportunities in terms of data analysis.
A large part of the research being carried out consists in reducing the drawbacks, such as the high cost of the computations performed on encrypted data for homomorphic encryption, or the risk of data utility loss for differential privacy.
For further reading, see the Royal Society of London’s report published in 2019: Protecting privacy in practice: The current use, development and limits of Privacy Enhancing Technologies in data analysis.